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Abstract 


Effective thread management is crucial to achieving good performance on large-scale distributed-memory 
multiprocessors that support dynamic threads. For a given parallel computation with some associated task 
graph, a thread-management algorithm produces a running schedule as output, subject to the precedence 
constraints imposed by the task graph and the constraints imposed by the interprocessor communications 
network. Optimal thread management is an NP-hard problem, even given full a priori knowledge of the 
entire task graph and assuming a highly simplified architecture abstraction. Thread management is even 
more difficult for dynamic data-dependent computations which must use online algorithms because their task 
graphs are not known a priori. This thesis investigates online thread-management algorithms and presents 
XTM, an online thread-management system for large-scale distributed-memory multiprocessors. XTM has 
been implemented for the MIT Alewife Multiprocessor. Simulation results indicate that XTM’s behavior 
is robust, even when run on very large machines. 
XTM makes the thread-management problem more tractable by splitting it into three sub-problems: 

1. determining what information is needed for good thread management, and how to efficiently collect 

and disseminate that information in a distributed environment, 


2. determining how to use that information to match runnable threads with idle processors, and 
3. determining what interprocessor communication style XTM should use. 
XTM« solves these sub-problems as follows: 


1. Global information is collected and disseminated using an X-Tree data structure embedded in the 
communications network. Each node in the X-Tree contains a “presence bit,” the value of which 
indicates whether there are any runnable threads in the sub-tree headed by that node. On a machine 
with a sufficiently high, balanced workload, the expected cost of maintaining these presence bits is 
proved to be asymptotically constant, regardless of machine size. 


2. The presence bit information, along with a combining process aided by the X-Tree, is used to match 
threads to processors. This matching process is shown to be eight-competitive with an idealized 
adversary, for a two-dimensional mesh network. 

3. A message-passing communication style yields fundamental improvements in efficiency over a shared- 
memory style. For the matching process, the advantage is shown to be a factor of log/, where / is the 
distance between an idle processor and the nearest runnable thread. 

Asymptotic analyses of X TM’s information distribution and thread distribution algorithms are given, show- 
ing XTM to be competitive with idealized adversaries. While the solutions to the sub-problems given above 
have provably good characteristics, it is difficult to say anything concrete about their behavior when com- 
bined into one coherent system. Simulation results are therefore used to confirm the validity of the analyses, 
with the Alewife Multiprocessor as the target machine. 
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Chapter 1 


Introduction 


MIMD multiprocessors can attack a wide variety of difficult computational problems in 
an efficient and flexible manner. One programming model commonly supported by MIMD 
multiprocessors is the dynamic threads model. In this model, sequential threads of execution 
cooperate in solving the problem at hand. Thread models can be either static or dynamic. In 
the static threads model, the pattern of thread creation and termination is known before run- 
time. This makes it possible for decisions regarding placement and scheduling of the threads 
to be made in advance, either by the user or by a compiler. Unfortunately, many parallel 
applications do not have a structure that can be easily pre-analyzed. Some have running 
characteristics that are data-dependent; others are simply too complex to be amenable to 
compile-time analysis. This is where the dynamic aspect of the dynamic threads model 
enters the picture. If such programs are to be run in an efficient manner on large-scale 
multiprocessors, then efficient run-time thread placement and scheduling techniques are 
needed. This thesis examines the problems faced by an on-line thread-management system 
and presents XTM, an X-Tree-based [25, 6] Thread-Management system that attempts 
to overcome these problems. 

The general thread-management problem is NP-hard [7]. The standard problem has the 
following characteristics: precedence relations are considered, the communications network 
is flat and infinitely fast, tasks take differing amounts of time to finish, preemptive schedul- 
ing is not allowed, and the thread-management algorithm can be sequential and off-line. 


Even if precedence relations are ignored, the problem is still NP-hard when tasks vary in 
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length and preemption is not allowed: it reduces to the bin-packing problem [7]. In the 
real world, such simplifications often do not apply: real programs contain precedence rela- 
tions, communications networks are neither flat nor infinitely fast, and in order for thread 
management algorithms to be useful, they must be distributed and run in real time. 

Since the overall problem is too difficult to tackle all at once, XTM breaks the thread- 
management problem down into three sub-problems, attacking each one separately. The 


sub-problems are identified as follows: 


1. determining what global information is needed for good thread management and how 


to efficiently collect and disseminate that information in a distributed environment, 


2. determining how to use that information to match runnable threads with idle proces- 


sors, and 
3. determining what interprocessor communication style to use. 


For each of these sub-problems, we present a solution and show through formal analysis that 
the chosen solution has good behavior. Finally, we demonstrate, using high-level simulation 


results, that the mechanisms work well together. 


1.1. Principles for Algorithm Design in Distributed Envi- 


ronments 


The optimal thread management problem is NP-hard, even when significantly simplified. 
The problem is further complicated by the requirement that it be solved in a distributed 
fashion. However, it is neither necessary nor practical to expect a multiprocessor thread 
manager to achieve an optimal schedule. The primary goal of such a system is to max- 
imize processor utilization, thus minimizing overall running time. This task is especially 
challenging because information about the state of the machine is generated locally at the 
processors, but information about the state of the entire machine is needed in order to make 
good thread-management decisions. Collection and distribution of this type of global state 
information is difficult in a distributed environment. Therefore, we set forth several general 


principles to guide our design efforts: 
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Eliminate ALL Hot-Spots: At any given time, the number of processors accessing a 
single data object and the number of messages being sent to a single processor should be 


limited. Otherwise serialization may result, with a corresponding loss in efficiency. 


Preserve Communication Locality: Threads should be run physically near to the data 
they access, so as to minimize time spent waiting for transmissions across the communication 


network. 


Minimize System Overhead: Overhead imposed by the thread manager should be kept 
to a minimum: time spent by a processor running system management code is time spent 


not running the application being managed. 


Given a design choice, the path that follows these principles more closely will be more 
likely to attain good performance in a large-scale distributed system. Experience shows 
that overall system performance will suffer if any piece of the thread-management system 


should fail to follow any of these principles. 


1.2 Contributions of This Thesis 


This thesis examines and develops thread-management algorithms for large-scale distrib- 
uted-memory multiprocessors. Guided by the design principles given above, we have de- 
veloped XTM, a thread management system that is sound from both a theoretical and a 
practical perspective. 


XTM solves the sub-problems identified above as follows: 


1. Global information is collected and disseminated using an X-Tree [25, 6] data structure 
embedded in the communications network (see Figures 4-1 and 4-2). Each node in the 
tree contains a “presence bit” whose value indicates whether there are any runnable 
threads in the sub-tree headed by that node. We show that on a machine with a 
sufficiently high, balanced workload, the expected cost of maintaining these presence 


bits is asymptotically constant, regardless of machine size. 
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Presence-bit maintenance follows a simple or rule at the nodes of the tree. The state of 
a node’s presence bit only changes when its first child’s presence bit goes from zero to 
one, or when its last child’s presence bit goes from one to zero. In this fashion, presence 
bit update messages are naturally combined at the nodes of the tree. This combining 
behavior has the effect of avoiding hot-spots that may otherwise appear at higher 
nodes in the tree and reduces XTM’s bandwidth requirements. Furthermore, the 
tree is embedded in the communications network in a locality-preserving fashion, thus 
preserving communication locality inherent to the application. Finally, the operations 
involved in maintaining global information are very simple, burdening the system with 


very little overhead. 


. The presence bit information, along with a combining process aided by the X-Tree, 
is used to match threads to processors. We show that this matching process can take 
no more than eight times as much time as a particular idealized (unimplementable) 


adversary, running on a two-dimensional mesh network. 


Multiple requests for work from one area of the machine are combined at the X-Tree 
nodes, allowing single requests for work to serve many requesters. In this manner, large 
volumes of long-distance communication are avoided, and communication locality is 
enhanced. Furthermore, if a single area of the machine contains a disproportionately 
large amount of work, a few requests into that area are made to serve large numbers 


of requesters, therefore avoiding hot-spot behavior in that area of the machine. 


. A message-passing communication style yields fundamental improvements in efficiency 
over a shared-memory style. For the matching process, the advantage is a factor of 


log 1, where / is the distance between an idle processor and the nearest runnable thread. 


The message-passing vs. shared-memory design choice boils down to an issue of 
locality. In a message-passing system, the locus of control follows the data; in a 
shared-memory system, the locus of control stays in one place. The locality gains 
inherent to the message-passing model yield significant performance gains that appear 


in both analytical and empirical results. 
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Chapter 5 gives asymptotic analyses of XTM’s information-distribution and thread-dis- 
tribution algorithms that show XTM to be competitive with one idealized adversary. 

In summary, XTM makes use of an X-Tree data structure to match idle processors with 
runnable threads. The X-Tree is used to guide efficient information distribution about where 
runnable threads can be found. The tree is also used to combine requests for work, so that a 
single work request can bring work to multiple idle processors. Finally, the mapping of the 
tree onto the physical processor array enhances communication locality of the application 
in a natural way, usually causing threads to run on processors near to where they were 
created. 

An implementation of XTM has been written for the MIT Alewife Multiprocessor [1, 2]. 
Alewife provides both an efficient implementation of the shared-memory abstraction and 
efficient interprocessor messages. X TM employs message-passing as its primary communi- 
cation style. This message-passing implementation is made possible by the static mapping 
of the X-Tree data structure onto the physical processor array. Use of messages not only 
lowers the cost of primitive functions like thread creation, but it also improves commu- 
nication locality over a shared-memory implementation. These locality-related gains are 
shown to become important as machine size increases. This thesis presents a detailed de- 
scription of XTM. It presents asymptotic analyses of X'TM’s information distribution and 
thread-distribution algorithms, showing XTM to be competitive with idealized algorithms. 
Simulation results bear out the theoretical analyses. 

In the process of studying the behavior of XTM and other thread-management algo- 
rithms, we have come to the following conclusions, using both analytical and empirical 


arguments: 


e As machines become large (> 256 processors), communication locality attains a po- 
sition of overriding importance. This has two consequences: First, a thread manager 
is itself an application. If the structure of the application is understood well enough, 
it can be implemented in a message-passing style, instead of using shared-memory. 
A message-passing implementation can achieve significant performance gains by low- 
ering demands on the communication system. This is the case when the locus of 


computation follows the data being manipulated, drastically reducing the cost of ac- 
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cessing that data. Second, a good thread manager should attempt to keep threads 


that communicate with one another close together on the machine. 


e Thread management is a global optimization problem. A good thread manager must 


therefore achieve efficient collection and distribution of relevant global information. 


e Parallel algorithms for large machines must avoid hot-spot behavior, or else risk losing 
the benefits of large-scale parallelism. Therefore, the algorithms presented in this 
thesis are all fully distributed, employing combining techniques for the collection and 
distribution of the threads being managed and of the global information needed to 


achieve good thread management. 


While the solutions to the sub-problems given above have provably good characteristics, 
it is difficult to say anything concrete about their behavior when combined into one coherent 
system. In order to study the behavior of different thread-management algorithms and to 
confirm the validity of the analyses, we developed a simulator for large-scale multiproces- 
sors. This simulator, called PISCES, models the Alewife architecture and produced nearly 
all of the data comparing different thread managers on various-sized multiprocessors. These 
simulation results confirmed the theoretical results: for large machines, the techniques em- 
ployed by XTM performed well. For example, a numerical integration application run on 
16384 processors and managed by XTM ran ten times faster than the same application 
managed by a diffusion-based thread manager, and three times faster than the same ap- 
plication managed by a round-robin thread manager. In fact, the XTM run was within a 


factor of three of a tight lower bound on the fastest possible running time. 


1.3. Systems Framework and Assumptions 


The systems framework under which this research was performed has the following charac- 


teristics: 


1. Thread creation is fully dynamic: a new thread may be created on any processor at 


any time. 
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Figure 1-1: Self-Scheduled Model: Any thread can run on any processor. The processors 


themselves determine which thread runs on which processor. 


2. Threads follow the Self-Scheduled Model [14], as depicted in Figure 1-1. This means 
that the processors that execute application code also contend between themselves to 
decide on which processor the various threads that make up the application are run. 
There is no fundamental link between threads and processors: any thread can run 
on any processor. The actual thread queue is implemented in a distributed fashion. 


Every processor maintains its own local queue of threads (see Figure 1-2). 


3. The scheduling policy is non-preemptive. A processor runs a thread until the thread 
either terminates or blocks on some synchronization construct. When the thread 
running on a processor blocks or terminates, the processor becomes idle: it needs 
to find another runnable thread to execute. The behavior of an idle processor varies 
significantly depending on the thread-management strategy. When a consumer-driven 
search is in place, idle processors search the machine for work; when a producer-driven 


search is being used, idle processors simply wait for more work to be delivered to them. 


This non-preemptive policy places two requirements on applications being managed. 
First, fairness among threads that constitute an application must not be an issue. 
Second, deadlock issues are handled entirely by the application. We assume that 
applications that need a contended resource will block on a synchronization construct 


associated with the resource, thus taking themselves out of the runnable task pool. 
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Figure 1-2: Implementation of the Self-Scheduled Model: Every processor maintains a local 
queue of threads. 


4. Differences between individual threads are ignored. Threads can differ with respect 
to such parameters as running time and communication behavior. For the purposes 


of this thesis, we assume that all threads run for the same amount of time. 


5. For the purposes of formal analysis, we assume that there is no inter-thread commu- 
nication. Furthermore, the shape of the application’s task graph is assumed to be 
unknown. However, the actual implementation of XTM optimizes for the case where 
the task graph is a tree, with the likelihood of inter-thread communication between 


two threads diminishing with distance in the tree. 


In order to analyze the performance of any algorithm on any multiprocessor, we need to 
know the communication structure provided by the machine. In this thesis, we examine that 
class of machines based on k-ary n-dimensional mesh communications networks connecting 
p processors, where p = k”. Such an architecture is simple and can scale to arbitrarily 
large sizes without encountering wire-packing or wire-length problems for n < 3 [5]. Similar 
analyses can be performed for networks with richer communication structures. In our 


analyses, we ignore the effects that network contention may have on performance.! 


Tn [10], it is shown that for a large class of machines, the effect of network contention on performance 
is no worse than the effect of network latency, within a constant factor. This is true for all of the machines 
and applications simulated for this thesis. 
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Parameter Description 


Number of processors on a particular machine 
P; abeling for processor number 1 


Parameters for a k-ary n-cube Mesh Network 


|k ‘| - Network Radix 
}n | Network Dimensionality 


k 
n 
Labeling for processor with mesh coordinates 
10,215 --bn—1! 0 < den < k-1 


Table 1.1: Machine Parameter Notation. 


1.4 Terminology and Notation 


Table 1.1 lists notation used throughout this thesis. The parameters in Table 1.1 describe 
the particular machine under discussion, in terms of size, labeling, network parameters and 
the ratio between network speed and processor speed. A multiprocessor of a given size 
is referred to as a p-processor machine; the processors on that machine are labeled P,, 
where 2 varies from 0 to p — 1. Mesh-based multiprocessors are defined in terms of n, their 
dimensionality, and k, their radiz. On a k-ary n-cube mesh multiprocessor, p = k”. Finally, 
the ratio between processor speed and network speed is given as t,,, the number of processor 


cycles it takes for one flit to travel one hop in the interconnection network. 


1.5 Outline 


The rest of this thesis is organized as follows. Chapter 2 presents other research in this 
area. Chapter 3 discusses high-level decisions that were made in the early stages of XTM’s 
design. In Chapter 4, we give a more detailed presentation of the information-distribution 
and thread-distribution algorithms at the heart of XTM. These algorithms are subject to a 
formal analysis in Chapter 5, especially with respect to asymptotic behavior. An empirical 
approach is taken in Chapters 6 and 7. Chapter 6 describes the experimentation method- 
ology used in examining the behavior of XTM’s algorithms and other interesting thread- 
distribution algorithms; Chapter 7 compares XTM with the other thread-management 
algorithms, as run on several dynamic applications. Finally, Chapter 8 presents conclusions 


and suggestions for future research. 
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Chapter 2 


Background 


Although the practical aspects of thread management and load sharing have been discussed 
in the literature for the past decade, thread management for large-scale multiprocessors has 
only been seriously examined during the latter half of that period. Kremien and Kramer [13] 


state 


...a flexible load sharing algorithm is required to be general, adaptable, stable, 
scalable, transparent to the application, fault tolerant and induce minimum 


overhead on the system... 


This chapter explores the work of other investigators in this area and evaluates that work 
with respect to Kremien and Kramer’s requirements. 

Znati, et. al. [31], give a taxonomy of load sharing algorithms, a modified version of which 
appears in Figure 2-1. Characteristics of generalized scheduling strategies are discussed 
in [16], in which the scalability of various candidate load-sharing schemes is examined, 
and [80], which looks at the effects of processor clustering. 

In order to meet the criteria given in [13], a thread management system must be dynamic 
and fully distributed. In this thesis, we are especially interested in dynamic methods because 
we want to be able to solve problems whose structure is not known at compile time. The first 
widely-discussed dynamic Self-Scheduling schemes were not fully distributed; they employed 


a central queue, which posed no problem on small machines. 


20 


Load Balancing 
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Gradient Scatter 


Figure 2-1: Taxonomy of Load Sharing Methods. 
2.1 Centralized Self-Scheduling 


One of the first references to the Self-Scheduling model in the literature appeared in 1985 
in [14]. In this paper, the authors described Self-Scheduling, although they never explicitly 


named it such: 


There are p processors, initially idle. At t = 0, they each take K subtasks from 
a job queue, each experiencing a delay h in that access. They then continue 
to run independently, taking batches of jobs and working them to completion, 


until all the jobs are done. 


Assuming that the subtask running times are independent identically distributed random 
variables with mean p and variance o?, and given a requirement that K remain constant, 


the authors derive an optimal value for K, as a function of n (the total number of subtasks), 


h, p, and a. They go on to show that when Sisee > 1 and a <1, the system efficiency! is 


System efficiency, for a given problem run on a given number of processors, is defined to be the total 
number of cycles of useful work performed divided by the total running time (in cycles) times the number 
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1 
at least ESE 


In 1987, Polychronopoulos and Kuck introduced Guided Self-Scheduling [21]. Their 
scheme, which is oriented towards the efficient scheduling of parallel FORTRAN loops, 
varies K with time. More specifically, whenever a processor becomes idle, it takes Fall jobs 
from the central queue, where r is the number of jobs in the queue at the time. Guided Self- 
Scheduling handles wide variations in thread run-times without introducing unacceptable 
levels of synchronization overhead. For certain types of loops, they show analytically that 
Guided Self-Scheduling uses minimal overhead and achieves optimal schedules. 

All of the early Self-Scheduling work assumes that h is both independent of p and 


unaffected by contention for the central job queue. These assumptions limit this work to 


be only applicable to small machines. 


2.2 Fully Distributed On-Line Algorithms 


We define fully distributed algorithms to be algorithms that contain no single (or small 
number of) serialization points on an arbitrarily large multiprocessor. Znati, et. al. [31], 
divide such algorithms into three sub-categories: bidding methods, drafting methods and 
hybrid methods. In bidding or producer-oriented methods, the creators of new work push 
the work off onto processors with lighter loads. In drafting or consumer-oriented methods, 
processors with light workloads locate and then steal excess work from processors with 
heavier workloads. In hybrid methods, producers and consumers cooperate in the load- 
sharing process. In the rest of this section, we give examples that have appeared in the 


literature for each of these categories of load-sharing algorithms. 


2.2.1 Bidding Methods 


Wu and Shu [28] present a Scatter Scheduling algorithm that is essentially the producer- 


oriented dual of the round-robin drafting algorithm described in Chapter 6. When running 


of processors the problem was run on: 
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this algorithm, processor 7 sends the first thread it creates to processor 7 + 1, the second 
to processor 7+ 2, and so on. Assuming that the application is suitably partitioned, this 
algorithm should get a relatively even load balance on machines of small and moderate size. 
This expectation is borne out in the results presented in [28]. However, since producer- 
processors send threads to essentially arbitrary destinations over time, any aspect of locality 
concerning data shared between related threads is lost. Furthermore the cost of thread 
creation goes up with the diameter of the machine, so on large machines, one would expect 


this algorithm to behave rather poorly. 


Znati, et. al. [31], present a bidding scheme that takes distance between sender and 
receiver into account. When a new thread is created on a given processor, the processor 
recomputes and broadcasts its load to the rest of the system. Every processor then compares 
this new load with its own, taking the distance between itself and the source processor and 
itself into account. All processors that are eligible to receive the task based on the source 
processor’s load, their own load and the distance between source and potential destination 


then contend for the job, the winner being the closest processor with the lightest load. 


There are two problems with this scheme, both related to scalability. First, every time 
a new task is created, a global broadcast takes place. The bandwidth requirement for 
such a scheme is not satisfiable on an arbitrarily large machine. Second, every processor 
has to participate in every scheduling decision. As the multiprocessor gets large and the 
corresponding number of threads needed to keep it busy gets large, the thread scheduling 
overhead required of each processor will become unacceptably large. In fairness to the 


authors, one must realize that this work was done in the context of 


...a loosely coupled large scale multiprocessing system with a number of process- 


ing elements interconnected through a broadcast based communication subnet... 


This statement implies that the machines this algorithm is intended for are of limited size 
(despite the use of “large-scale”), and that the grain size of the tasks is also rather large, 


minimizing the effect of scheduling overhead on overall performance. 
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2.2.2 Drafting Methods 


Ni, Xu and Gendreau [20] present a drafting-style load-sharing algorithm that also seems 
to be oriented towards a relatively small number of independent computers connected to a 
single Local Area Network. In this scheme, each processor has an idea of the state of all 
other processors by maintaining a “load table,” which has an entry for every other processor 
in the system. Processors can be in one of three states, light load, normal load or heavy 
load. As in [31], every processor in the system is informed of all changes in load on every 
other processor by means of broadcasts. When a processor P; goes into the lightly loaded 
state, it sends a request for work to every processor in the system that it thinks is in the 
heavy load state. Each of those processors will respond with a “draft-age,” which is a 
measure of how much the entire system would benefit from an exchange of work between 
that processor and P;. P; then determines which candidate will yield the highest benefit 
and gets work from that processor. 

The same objections that applied to the scheme proposed in [31] apply here: such an 
algorithm isn’t really scalable. Furthermore, the “draft-age” parameter used to compare 
drafting candidates does not take the distance between the two processors into account. This 
lack of attention to communication distances further limits this algorithms effectiveness on 


large multiprocessors. 


2.2.3. Hybrid Methods 
Diffusion 


Halstead and Ward proposed diffusion scheduling [9] as a means of propagating threads 
throughout the machine. Their description is a rather brief part of a larger picture and 
gives no specific details. However, the settling time for Jacobi Relaxation is proportional 
to the square of the diameter of the mesh upon which the relaxation takes place [3]. More 
sophisticated relaxation techniques, such as Multigrid methods [3], achieve much better 
convergence times at the expense of the locality achieved by simple diffusion methods. 
The XTM algorithm presented by this thesis can be thought of as the multigrid version 


of diffusion scheduling. We implemented both the simple diffusion scheduling algorithm 
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described here and XTM. Results are given in Chapter 7. 


Gradient Model 


The Gradient Model proposed in [19] has a communication structure similar to that of 
diffusion scheduling. Processors are defined to be in one of three states: lightly loaded, 
moderately loaded or heavily loaded. Instead of moving threads based on the difference 
in the workload on neighboring processors, this scheme builds a gradient surface derived 
from estimates of the distance to the nearest lightly loaded processor. The gradient surface 
is defined to have a value of zero on lightly loaded processors; on every other processor, 
the value of the surface is defined to be one more than the minimum of its neighbors. 
The resulting surface gives an indication of the distance from and direction towards the 
nearest lightly loaded processor. The gradient surface is built by propagating information 
between nearest neighbors. A heavily loaded processor acts as a job source, sending jobs 
down the gradient in any “downhill” direction. A lightly loaded processor acts as a job 
sink, accepting jobs flowing towards it. A moderately loaded processor accepts jobs as if 
it were lightly loaded, but acts like a heavily loaded processor with respect to building the 
gradient. Finally, when there are no lightly loaded processors in the system, the gradient 
surface eventually flattens out at a maximum value equal to the machine diameter plus one. 
A machine in this state is said to be saturated. 

The Gradient Model appears to scale well, independent of machine topology. However, 
there are some questions regarding its performance. The first question concerns the behavior 
of a saturated machine. In such a case, when a single processor becomes lightly loaded, a 
wave propagates out from that processor throughout the entire machine. In this manner, 
a number of tasks proportional to the square of the diameter of the machine will move 
between processors in response to a small local perturbation in the state of the machine. 
Second, gradient construction and job propagation takes place as a periodic process on 
every processor regardless of the state of the machine. This imposes a constant overhead 
on all processors independent of how well the processing load is spread about the machine: 
a price is paid for load-sharing whether needed or not. Finally, there is some question as 


to the stability of this scheme. It is easy to see that it tends to move jobs from heavily 
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loaded processors towards lightly loaded processors; however, it seems possible to construct 
situations in which the time lag imposed by the propagation of gradient information could 
cause jobs to pile up at a single location, leading to a poor load balance at some time in 


the future. 


Random Scheduling 


Rudolph, et. al. [24], propose a load balancing scheme in which processors periodically 
balance their workloads with randomly selected partners. The frequency of such a load 
balancing operation is inversely proportional to the length of a processor’s queue, so that 
heavily loaded machines (which have little need for load balancing) spend less time on load 
balancing leaving more time for useful computation. A probabilistic performance analysis 
of this scheme shows that it tends to yield a good balance: all task queues are highly likely 
to be within a small constant factor of the average task queue length. 

Unfortunately, this work assumes that the time to perform a load balancing operation 
between two processors is independent of machine size. Clearly, in a large machine, it is 
more costly to balance between processors that are distant from each other than between 
processors that are close to each other. Since this scheme picks processors at random, the 
cost of a load balancing operation should rise for larger machines. Also, nothing is said in 
this paper about settling times or rates of convergence. The claim that a load-balancing 
scheme yields a balanced system is not worth much if the time it takes to achieve that 


balance is longer than the time a typical application is expected to run. 


We have briefly surveyed a number of thread management algorithms given in the lit- 
erature. For each algorithm, we have listed one or more potential problems that may be 
encountered when implementing the algorithm on a large-scale multiprocessor. In the rest 
of this thesis, we describe and evaluate a new thread management scheme that attempts to 
overcome all of these objections and tries to meet the requirements set forth by Kremien 


and Kramer [13]. 
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Chapter 3 


High-Level System Design 


In this chapter, we present several high-level decisions that we made early on in the XTM 
design effort. These decisions were based mainly on the design principles given in Chapter 1: 
eliminate hot-spots, preserve communication locality and minimize system overhead. The 
details of the algorithms used by XTM will be given in Chapter 4. The goal of this chapter 
is to give the intuition behind the design of those algorithms. 

As stated in Chapter 1, we break the thread management problem down into three 


sub-problems: 


1. determining what global information is needed for good thread management and how 


to efficiently collect and disseminate that information in a distributed environment, 


2. determining how to use that information to match runnable threads with idle proces- 


sors, and 
3. determining what interprocessor communication style to use. 
Stated briefly, the high-level solutions to each of those sub-problems are: 


1. Global information is collected and disseminated using an X-Tree [25, 6] data structure 
embedded in the communications network. Each node in the tree contains a “presence 
bit” whose value indicates whether there are any runnable threads in the sub-tree 


headed by that node. 
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2. The presence bit information, along with a combining process aided by the X-Tree, is 


used to match threads to processors. 


3. A message-passing communication style yields fundamental improvements in efficiency 


over a shared-memory style. 


In this chapter, we discuss each of these solutions in detail. 


3.1 Global Information Management 


In order to make good scheduling decisions, information about the state of the entire ma- 
chine is needed, but this information is generated locally at the processors. A tree is a 
scalable data structure that can be used to efficiently collect and disseminate global in- 
formation while avoiding hot-spots through combining techniques. Therefore, we use a 
tree-based data structure to aid in the efficient distribution of information about the state 


of the machine. 


A tree, while good for efficiently collecting and distributing data, can create artificial 
boundaries where none actually exist. Processors that are physically near one another 
can be topologically distant from one another, depending on their relative positions in the 
tree, even when the tree is laid out so as to preserve as much of the locality afforded by 
the communications network as possible. This loss of locality can be alleviated by adding 
connections in the tree between nodes that are physically near each other. Such a tree, 
called an X-Tree, is the basic data structure upon which XTM’s algorithms are based. 
Despain, et. al. [25, 6], first introduced the X-Tree data structure as a communications 
network topology. The particular variant of X-Tree we use is a full-ring X-Tree without 


end-around connections. 


The rest of this section discusses the need for global information in solving the dy- 
namic thread management problem. We then suggest combining trees as a mechanism for 


efficiently collecting and disseminating such global information. 
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3.1.1 The Need for Global Information 


As discussed in Chapter 1, thread management is essentially a global optimization problem. 
Even if we forego optimality, it is easy to see how global knowledge is necessary in order 
to make good local decisions. For example, in a system where the producers of threads are 
responsible for deciding where they should be run, the choice of where to send a new thread 
is strongly influenced by the overall state of the machine. If the machine is relatively “full,” 
then we want to keep a newly created thread near its creator to enhance locality, but if the 
machine is relatively “empty,” we may need to send the thread to some distant processor 
to improve load-sharing. Similarly, in a system where idle processors are responsible for 
“drafting” work, the overall state of the machine is just as important. If there is no work to 
be found on nearby processors, a searcher has to know what regions of the machine contain 
threads available to be run. 

As shown in Chapter 7, diffusion methods and round-robin methods perform relatively 
poorly on large multiprocessors. Such thread-management algorithms share the attribute 
that they use no knowledge about the overall state of the machine. This seems to reduce the 
effectiveness of such methods, since some knowledge of overall machine state is necessary 
to achieve good load-sharing. 

Management of global knowledge can be prohibitively expensive on large-scale machines. 
The minimum latency for a global broadcast is proportional to the machine diameter. Fur- 
thermore, such a broadcast places a load on the communications network at least propor- 
tional to the number of processors. If every processor continually produces information 
that changes the global state, it is clearly unacceptable for each processor to broadcast that 
information to the all other processors every time a local state change occurs. Similarly, if 
global information is concentrated on one node, prohibitive hot-spot problems can result. 
If every processor has to inform that node in the event of a change, and if every processor 
has to query that node to find out about changes, then the network traffic near that node 
and the load on the node itself becomes overwhelming, even for relatively small machines. 

So the key question is the following: if global information is necessary in order to perform 
effective thread management, how can we manage that information in such a way as to not 


put an unacceptable load on any processor or any part of the communications network? 
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3.1.2 Prospective Solutions 


As discussed in Section 1.3, we implement the global Self-Scheduled model [14, 21] by main- 
taining a queue of runnable threads on every processor. Therefore, the global information 
useful to a thread-management algorithm could potentially include the state of every one 
of these queues. However, the cost of keeping every processor informed of the state of 
every other processor’s thread queue would quickly become unmanageable, even for rel- 
atively small machines, irrespective of communications architecture. Some way to distill 
this global information is needed, such that the cost of collecting and disseminating the 
distilled information is acceptable, while keeping around enough information to make good 
thread-management decisions. 

Software combining [29] presents itself as the obvious way to keep information collection 
and dissemination costs manageable. When combining techniques are employed, the load 
on the communications network can be held to an acceptable level. Furthermore, if one is 
careful, certain combining strategies can guarantee an acceptably low load on all processors, 
even ones that contain nodes high up in the combining tree. Consequently, we require that 
the global information used by the thread manager must be organized in such a manner 
that combining techniques apply: any operation used to distill two separate pieces of global 
information into one piece must be associative. 

The most straightforward way to distill the state of an individual processor queue into 
one piece of information is to take the length of that queue. A simple sum can then be 
used to combine two such pieces of data (see Figure 3-1). Each node in such a combining 
tree keeps track of the sum of the queue lengths for the processors at the leaves of the 
subtree headed by that node. Unfortunately, it quickly becomes apparent that maintaining 
an exact sum is both expensive and impossible: impossible because of the communication 
delays along child-parent links in the tree, and expensive because in such a scheme, nodes 
near the root of the tree are constantly updating their own state, leaving no time for useful 
work (execution of threads). 

Since maintenance of exact weights in the tree is impossible anyhow, perhaps approxi- 
mate weights could keep costs acceptably low, while still providing sufficient information for 


the thread manager to function acceptably. In Chapters 5 and 7, we explore two such ap- 
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Figure 3-1: Combining Tree with Exact Queue Lengths: The tree is implemented as a data 
structure distributed among the various processors in the system. Each node in the tree keeps track 
of the total number of runnable threads on queues on the processors at the leaves of the subtree 
headed by that node. The rectangular boxes labeled P; represent processors in a one-dimensional 
mesh. Each processor holds a queue of threads; in some cases, that queue is empty. The circles 
represent nodes in the combining tree. Node N! represents a node at level | in the tree residing on 
processor P;. Details of the mapping of tree nodes onto processors are given in Section 4.1.2. 


proximation techniques. The first of these techniques maintains weights at each node of the 
combining tree as before, but instead of informing its parent whenever its weight changes, 
a node only informs its parent of a weight change that crosses one of a well-chosen set of 
boundaries. Details of this technique for disseminating global information are discussed 
more thoroughly in Section 6.4, under the heading XTM-C. 

Results in Chapter 7 show that an even simpler approximation technique gives bet- 
ter practical results. Figure 3-2 illustrates this simpler approximation, which reduces the 
“weight” maintained at each node to a single “presence” bit. A node’s presence bit is turned 
on when any of the processors at the leaves of the subtree headed by that node has at least 


one runnable thread on its queue. 
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Figure 3-2: | Combining Tree with Presence Bits: The tree is distributed among the processors 
as described in Figure 3-1. Each node in the tree maintains a “presence bit,” which is turned on 
whenever any queue on any processor at any leaf on the subtree headed by that node contains at 
least one runnable thread. 


3.2 Matching Threads with Processors 


In many dynamic computations, it is unavoidable that situations arise in which one area 
of the machine is rich with work, with most or all processors busy, while another area is 
starved for work, with most or all processors idle. In such situations, the communications 
network can easily be overloaded as work is transferred from the rich area to the sparse 
area. Furthermore, a tree-based algorithm is prone to hot-spot behavior at higher-level 
nodes. Both of these problems can be avoided if combining is employed: a single request 
can be made to serve multiple clients. 

Combining techniques are not only useful for collecting and disseminating information, 
they can also be used to collect and disseminate the threads themselves. As an example of 


why combining is essential, consider the case where one section of a large machine is very 
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busy, containing an excess of runnable threads. At the same time, some other section of 
the same machine is nearly idle, with very few runnable threads. If combining is not used, 
then drafting-style thread managers would have each processor in the idle section requesting 
work from the busy section, while producer-driven managers would have each processor in 
the busy section sending work over to the idle section. In either case, there would be many 


messages being sent over long distances. 


We propose the use of combining to cut down on the number of long-distance messages. 
This combining should cause a single message to go from an idle section to a busy section, 
where all threads to be sent over are gathered, sent back in a single chunk, and then 


distributed among the idle processors. 


3.2.1 X-Tree-Based Design 


The need for combining suggests the development of algorithms based on trees: a tree is 
an ideal structure around which to build algorithms that employ combining. Furthermore, 
trees are easy to embed in most known architectures in a natural, locality-preserving fashion. 
However, most communications architectures provided a richer communications structure 
than that of a tree. Therefore, if simple tree-based designs are used, there is a potential 
for a loss of locality: the tree can introduce topological boundaries where no physical 
boundaries exist. Locality lost in this manner can be regained by adding connections in the 
tree between nodes that are physically near each other. A tree enhanced with such links, 
introduced in [25] and [6] as a full-ring X-Tree without end-around connections, is the basic 


data structure upon which XTM’s algorithms are based (see Figures 4-1 and 4-2). 


In Chapter 5, we show that the nearest-neighbor links are needed to get good theoretical 
behavior. However, results in Chapter 7 demonstrate that for most applications, the simple 
tree-based design attains significantly higher performance than the X-Tree-based design. 
For systems in which the ratio of processor speed to network speed is higher, the X-Tree’s 


performance surpasses the simple tree’s performance. 
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3.3 Shared-Memory vs. Message-Passing 


We use message-passing as the primary means of interprocessor communication. This re- 
duces XTM’s communication requirements, thus increasing performance over a shared- 
memory implementation. In Chapter 5 these performance gains are shown to become sig- 
nificant on large machines. 

One of the earliest design decisions concerned programming style: should programs 
employ shared-memory or message-passing? This may seem odd, since any algorithm im- 
plemented in one of the two styles can be implemented in the other. However, there are 
compelling theoretical and practical arguments in favor of the message-passing style. This 
research was performed with the Alewife machine as the primary target machine. Since 
Alewife supports both efficient shared-memory and message-passing, we had the luxury of 
choosing between the two styles. 

To start with, we give informal definitions of the terms “shared-memory” and “message- 
passing.” Briefly, the central difference between the two mechanisms is whether or not each 
communication transaction is acknowledged individually. In a message-passing environ- 
ment, no acknowledgment is required for individual messages; in a sequentially consistent 
shared-memory environment, every communication transaction requires either a reply con- 
taining data or an acknowledgment that informs the requesting processor that the requested 
transaction is complete. 

The two kinds of nodes in a shared-memory environment: processing nodes and memory 
nodes. Throughout this thesis, when we say shared-memory, we really mean the sequentially 
consistent shared-memory [17]. Communication takes place in the form of a request from 
a processor to a memory node, requiring an acknowledgment when the request has been 
satisfied. The actual low-level particulars of such a transaction depend on such machine- 
specific details as the existence of caches and the coherence model employed. Requests come 
in the form of reads, which require a reply containing the data being read; writes, which 
modify the data and require an acknowledgment, depending on the machine details, and 
read-modify-writes, which modify the data and require a reply containing the data. 

In a message-passing environment, there are only processing nodes. Data is kept in 


memories local to the processors, and one processor’s data is not directly accessible by an- 
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other processor. Communication takes place in the form of messages between the processors, 
which require no acknowledgments. 

Message-passing systems have the potential to achieve significantly more efficient use of 
the interprocessor communications network than do shared-memory systems. In many cases, 
the message overhead required to support the shared-memory abstraction is not needed for 
correct program behavior. In those cases, a message-passing system can achieve significantly 
better performance than a shared-memory system [11]. When, however, a large fraction 
of interprocessor communication is in the read, write, or read-modify-write form directly 
supported by a shared-memory system, then using a shared-memory system may yield 
better performance because shared-memory systems are usually optimized to support such 
accesses very efficiently. When the distance messages travel in the communications network 
is taken into account, the advantage held by a message-passing system over a shared-memory 
system can be significantly greater. For the algorithms employed by XTM, the gain can 
be as high as a factor of logp, where p is the number of processors in the system (see 
Chapter 5). The advantages of message-passing are not limited to asymptotics. Kranz and 
Johnson [11] show that the cost of certain primitive operations such as thread enqueueing 
and dequeueing can improve by factors of five or more when message-passing is used. 

If message-passing is so superior to shared-memory, why use shared-memory at all? This 
question is also addressed in [11]. There are two answers to this question. First, certain 
types of algorithms are actually more efficient when run on shared-memory systems. Second, 
it seems that the shared-memory paradigm is easier for programmers to handle, in the same 
way that virtual-memory systems are easier for programmers than overlay systems. Even 
for the relatively simple algorithms employed by XTM, implementation was significantly 
easier in the shared-memory style. However, the performance gains afforded by message- 
passing outweighed the additional complexity of implementation, and message-passing was 


the ultimate choice. 


In this chapter, we have presented and argued for a number of early high-level design 
decisions. In the next chapter, we show how we assembled these decisions to produce a 


detailed design of a high-performance thread-management system. 
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Chapter 4 


X-Tree-Based Thread 


Management 


This thesis presents a thread distribution system based on an X-Tree-driven search. An 
X-Tree [25, 6] is a tree augmented with links between same-level nodes. The particular 
variant we use contains near-neighbor links between touching same-level nodes. In the 
parlance used in [25] and [6], this is a full-ring X-Tree without end-around connections (see 
Figures 4-1 and 4-2). In this chapter, we describe in detail the algorithms that go into 
XTM, an X-Tree-based Thread Manager. 

When a new thread is created, its existence is made public by means of presence bits 
in the X-Tree. When a node’s presence bit is set, that means that there are one or more 
runnable threads somewhere in the sub-tree rooted at that node; a cleared presence bit 
indicates a sub-tree with no available work. Presence information is added to the X-Tree 
as follows: when a thread is either created or enabled (unblocked), it is added to the 
runnable thread queue associated with some processor. If the queue was previously empty, 
the presence bit in the leaf node associated with that processor is set. The presence bits 
in each of the node’s ancestors are then set in a recursive fashion. The process continues 
up the X-Tree until it reaches a node whose presence bit is already set. In Chapter 5, 


we show that although this algorithm can cost as much as O (nk),! the expected cost is 


ln, is the mesh dimensionality; k is the mesh radix. 
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Figure 4-1: Binary X-Tree on a One-Dimensional Mesh Network: Labels of the form P; indicate 
physical processors. Circles indicate nodes in the X-Tree. Thick lines indicate parent-child links. 
Thin lines indicate near-neighbor links. 


actually O (4). when the distributions of thread queue lengths are independent, identically 
distributed random variables with probabilities of non-zero length of at least A, for some 
A: Prob (qlen > 0) > A. Whenever a thread is taken off of a queue, if that causes queue to 


become empty, a similar update process is performed, with similar costs. 


When a processor becomes idle, it initiates a thread search. A thread search recursively 
climbs the X-Tree, examining successively larger neighborhoods in the process. When a 
given node is examined, it and all of its neighbors are queried as to the status of their 
presence bits. If none of the bits is set, the search continues with the node’s parent. 
Otherwise, for one of the nodes whose presence bits are set, the searcher requests half of the 
work available on the sub-tree rooted at that node. Such a request is satisfied by recursively 
requesting work from each child whose presence bit is set. In Chapter 5, we show that this 
search algorithm is guaranteed to finish in O (nd) time, where n is the dimensionality of 
the mesh and d is the distance between the searcher and the nearest non-empty queue. 
This is O (n)-competitive with an optimal drafting-style thread manager. Furthermore, in 


order to limit long-distance accesses, the search algorithm combines requests from multiple 
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Figure 4-2: Quad-X-Tree on a Two-Dimensional Mesh Network: Labels of the form P;,;, indicate 
physical processors. Circles indicate nodes in the X-Tree. Thick lines indicate parent-child links. 
Thin solid lines indicate near-neighbor links to edge-adjacent neighbors. Thin dashed lines indicate 
near-neighbor links to corner-adjacent neighbors. 


children of each node: for a given sub-tree, only a single representative searches outside 
that sub-tree at a time. When the representative comes back with a collection of runnable 
threads, the threads are divided among the other searchers that were blocked awaiting the 
representative’s return. 

XTM is based on an X-Tree data structure embedded into the communications network. 
The nodes of the X-Tree contain presence bits whose values are updated whenever a thread 
is either created or consumed. The presence bits are used to drive a search process that gets 
runnable threads to idle processors. Section 4.1 describes the details of the X-Tree, including 
its embedding in the network. Section 4.2 gives the details of the presence bit update 


algorithm. Section 4.3 describes the process that sends runnable threads to idle processors. 
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Finally, Section 4.4 gives an example of these algorithms in action. All algorithms described 


in this chapter are assumed to employ the message-passing communication style. 


4.1 X-Trees 


One- and two-dimensional X-Trees are pictured in Figures 4-1 and 4-2. For the algorithm 
presented in this thesis, an X-Tree is a software data structure. The nodes that make up the 
X-Tree are distributed around the communications network on the processors. The X-Tree 
nodes include one leaf node on each processor and enough higher-level nodes to make up 
the rest of the tree. The higher-level nodes are distributed throughout the machine in such 
a way as to keep nodes that are topologically near to one another in the tree physically 
near to one another in the network (see Section 4.1.2). The X-Tree guides the process of 
matching idle processors with runnable threads, using near-neighbor links between nodes 
that are physically near to each other. On an n-dimensional mesh network, each node in 
the X-Tree has up to 3” — 1 near-neighbor links. Each leaf of the X-Tree is associated with 


a physical processor, and is stored in the memory local to that processor. 


4.1.1 Notation 


The X-Tree data structure is embedded in the communications network. The individual 
nodes that make up the X-Tree are resident on the processors. Each node is labeled with 
its level in the tree and the mesh coordinates of the processor it resides on. Tree levels start 


at zero at the leaves, one at the next higher level and so on. A node is identified as: 


N! 


40 j21 522° 


where | is the level of the node in the tree and 7%,71,... are the mesh coordinates of the 
processor on which the node resides (Pj, j,,...). This notation does not distinguish between 
two or more same-level nodes on a single processor. This is not a problem because we 
are not interested in any embeddings that put two or more same-level nodes on the same 
processor: any embedding that puts more than one same-level node on a single processor 


will tend not to distribute tree management costs as well as embeddings that put only one 
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node at each level on a single processor; experience shows that systems that distribute their 


overheads poorly tend to perform poorly overall. 


The Manhattan distance through the mesh between two nodes N! 


j,... and Ni” is writ- 
Ott ye : 


JOSJAs 


ten: 


D (Noo ryt Ne ja...) = ldo — tol + [da — fal +. 


We assume that the cost of communicating between two nodes is equal to the Manhattan 


distance between the nodes. This is true, for example, for the e-cube routing scheme [26]. 


We use different notation to refer to the distance covered when taking the shortest path 


through the X-Tree between two nodes: 


AEN tN 


Osbagesed Jos gi js 
In other words, while D (Nb ao Ne i,..) is the shortest distance between Nha ving and 
Ny. j1,... through the mesh, while Nau NP 5...) is the shortest distance between the 


two nodes when traversing the X-Tree. 


The mith ancestor of node N? 


ioyii,... 18 Written: 


Te 
LO ;t1 50+ 


th , which is the m+ Ith of node NO 


l 
ancestor of any node N; ie 


More generally, the m pel: 


is written: 


lt+m 
40 ;21 50-2 


We use this notation to discuss relationships between X-Tree nodes and their ancestors. In 
particular, we need to discuss costs associated with climbing the tree from a node to one of 


its ancestors. 
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The relationship between a node (N! ) and its parent (A 


0,21) 


expressed through a set of (usually simple) embedding functions: 
19 = fo (to,41,--50)3 4 = fi (to, t1, 50); 


Clearly, the costs associated with traversing an X-Tree depend on how the tree is embedded 
into the communications network. These embedding functions are used to formalize a 
particular embedding scheme. Formal statements of embedding functions re used in formal 


algorithm statements and in mathematical proofs. 
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Figure 4-3: Embedding a Binary Tree in a One-Dimensional Mesh: Naive Solution. 


4.1.2 Embedding a Tree in a k-ary n-Dimensional Mesh 


An X-Tree embedding has to capture the locality in the communications network while bal- 
ancing the overhead its management costs place on the individual processors. The locality 
captured by the X-Tree depends on its embedding in the underlying space: a “good” em- 
bedding places nodes that are topologically near one another in the X-Tree physically near 
one another in the network. The X-Tree should also be distributed so as to minimize the 
maximum load placed on any individual processor by any part of the thread-management 


algorithm. 


One-Dimensional Case 


Figure 4-3 illustrates a straightforward embedding of a binary tree in a one-dimensional 
mesh. Each processor’s leaf node is resident on that processor. The following expression 


captures the relationship between a node and its ancestors: 
Ai = Ni; i=in (-2') 
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Figure 4-4: Embedding a Binary Tree in a One-Dimensional Mesh: Better Solution. 


Using this embedding, if a node on P; has an ancestor at level J, that ancestor is on Py, 
where 7’ is simply i with | low-order bits masked off. This embedding has two major 
drawbacks. First, worst-case tree-traversal costs between leaf nodes are worse than they 
need to be. Second, note that some processors (e.g., Po) contain more nodes of the tree 
than others. This leads to higher network contention near those heavily used nodes, cutting 
down on overall performance. Also, the processors and memories on those nodes are more 
heavily loaded down with thread management overhead, leading to even more unbalance 


and worse performance. 


Figure 4-4 illustrates a better embedding of a binary tree in a one-dimensional mesh: 
Al = Nj; = [ia (-2'7)] v2"? (for > 1) 


Using this embedding, if a node on P; has an ancestor at level J, that ancestor is on P,, 
where #! is ¢ with | — 1 low-order bits masked off and with the I” bit set to one. This gives 
a better worst-case tree traversal cost than the first embedding. Furthermore, it guarantees 


that at most two tree nodes are resident on any given processor, yielding better hot-spot 
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behavior. 

The distance between any two nodes in a one-dimensional mesh is: 
D(NI,NP) =| -a]. 

For the suggested embedding, the distance between any node at level m and its parent is: 
D (Nm, AP) — gm 

except at the leaves of the tree. 


The distance between a leaf node and its parent is: 
D (Nf, A?) =either 0 or 1, 
depending on whether the parent is on the same processor as the child or not. 


In the one-dimensional case, all of a node’s near neighbors are the same distance from the 
node: 


DUNS, i-tam ) as 


ta 
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Figure 4-5: Embedding a Quad-Tree in a Two-Dimensional Mesh. 


Two-Dimensional Case 


Figure 4-5 illustrates a well-distributed embedding of a quad-tree in a two-dimensional 


mesh: 


Ai, = Nines i = [io A(-2'7)] vt, = [iva (-2-7)] va? (for 2 > 1) 


20,21 to 


Using this embedding, if a node on P;,; has an ancestor at level /, that ancestor is on Py j:. 
i! is i with | — 1 low-order bits masked off and with the I” bit set to one; j’ is j with] —1 
low-order bits masked off and with the I” bit set to one. 
For any two-dimensional embedding, the distance between any two nodes is: 
2D (Nii NR) = |jo — to] + 11 — tal - 
For the suggested embedding, the distance between any node at level m and its parent is: 
DNS Ae S20), 


40,219 4*t0,%1 


45 


except at the leaves of the tree. 
The distance between a leaf node and its parent is: 


D (ni! Al ) = either 0, 1 or 2, 


0,417 7" t0,%1 
depending whether the parent is on the same processor as the child or not. 


In the two-dimensional case, a node has two classes of near neighbors: those that are 
edge-adjacent and those that are verter-adjacent. Each node has up to four edge-adjacent 
neighbors and up to four vertex-adjacent neighbors. For example, in Figure 4-2, node Noi 
has five near neighbors: Noo: No: NP, Ne» and No.2: Noo» NP, and Noe are edge- 


adjacent; NP o and NP» are vertex-adjacent. 
The distance between any node at level m and its face-adjacent nearest neighbors is: 


D (Nein Noma, ) =D (NR Naso) = 2. 


Oxt1? 0,217 ‘40,012 


The distance between any node at level m and its vertex-adjacent nearest neighbors is: 


i0,217~ ‘to 


D(NE Non gave) = 22"). 


n-dimensional Case 


It would be difficult to illustrate an tree embedded in an n-dimensional mesh. However, we 
can generalize the results presented in earlier sections to obtain the following embedding 
functions for / > 1: 


l 
20,21 ,---:2n—-1 


i) = [ion (-2'7)] vai 
1 Oo= [in A (-2'-*)] yor 
oe [int A (-2'-*)] yore 


Using this embedding, if a node on Pj, j,,..i,,_,; has an ancestor at level /, that ancestor is 


on Py yi. All the i|’s are obtained by taking the corresponding 7,, masking off the 


up 
n— 


! — 1 low-order bits and setting the Ith bit to one. 


46 


For any n-dimensional embedding, the distance between any two nodes is: 


D (Nha gaeaina? Nfotenin1) = Yo — tol + la — tal ++ ljna — teal 


For the given embedding, the distance between any node at level m and its parent is: 


D (vm Amt ) =n Cage 


Fost y--sin—1) 4740,41,...54n—1 
except at the leaves of the tree. 


The distance between a leaf node and its parent is: 


D (NE Al 


Ob yeesbn—1 9 78081 bn 1 


) = either 0, 1, ..., or n. 


In the n-dimensional case, a node has n classes of near neighbors, each of which is a different 
distance from the node. We label these classes with the index 7, where 7 ranges from 1 to 


n. The maximum number of neighbors in class 27 is: 


n si 
: 
Note that this yields a maximum total of 3” — 1 neighbors. 


The distance between any node at level m and a neighbor of class 2 is: 


i(2™), 


4.2 Updating Information in the X-Tree 


Every processor has its own runnable thread queue. When a new thread is created, it is 
added to the creating processor’s queue. Similarly, when a thread is enabled (unblocked), it 
is added to the queue of the processor on which it most recently ran. Threads can be moved 
from one processor’s queue to another by a thread search initiated by an idle processor (see 
below). Before a thread is run, it is removed from its queue. 

When a thread is added to or taken from a queue, this fact is made public by means of 


presence bits in the X-Tree. When a node’s presence bit is set, that means that there are 


AT 


one or more runnable threads to be found somewhere in the sub-tree rooted at that node; 
a cleared presence bit indicates a sub-tree with no available work. Of course, since the 
X-Tree is a distributed data structure, updating this presence information can not occur 
instantaneously; as long as an update process is in progress, the presence information in 
the tree is not absolutely accurate. Therefore, the thread distribution algorithm can only 
use these presence bits as hints to guide a search; it must not depend on their absolute 


accuracy at all times. 


Presence information is disseminated as follows: when a runnable thread queue goes 
either from empty to non-empty or from non-empty to empty, an update process is initiated. 
This process recursively climbs the tree, setting or clearing presence bits on its way up. A 
presence-bit-setting update process terminates when it reaches a node whose presence bit is 
set; likewise, a presence-bit-clearing update process terminates when it reaches a node that 
has some other child whose presence bit is set. Appendix A gives a more formal pseudocode 


description of the presence bit update algorithm. 


Update processes have the responsibility of making it globally known that work is avail- 
able. These efforts are combined at each node of the tree. The world only needs to be 
informed when a queue goes either from empty to non-empty or from non-empty to empty. 
When that happens, presence bits in the tree are set or cleared by an update process that 
recursively climbs the tree, terminating when it reaches a node whose presence bit state 
is consistent with its children’s states. When more than one update process arrives at a 
given node in the tree at one time, the processes are executed atomically with respect to 
each other. Assuming that no new information is received between the execution of these 
processes, only the first process ever proceeds on up the tree. In this way, the information 


distribution algorithm combines its efforts at the nodes of the tree. 


Whenever a presence bit in a node changes state, the node’s neighbors and parent are 
all informed. The presence-bit status of each of a node’s neighbors and children is cached 
locally at the node, decreasing search time significantly. Of course, this increases the cost 
of distributing information throughout the tree, but as Chapter 5 shows, the expected cost 
of information distribution is very low due to the decreasing likelihood of having to update 


nodes higher up in the tree. 
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Figure 4-6: Two-Dimensional Presence Bit Cache Update: When the state of a node’s presence 
bit changes, it informs its nearest neighbors, which cache presence bit information in order to cut 
down on usage of the communications network. This information is disseminated using a simple n- 
level multicast tree algorithm. Thick arrows indicate the first phase of the information dissemination 
process (N}’, informs Nj, and N}.,); thin arrows indicate the second phase (N}, informs N) and 
NP, while NQ. informs Nj, and Nj,, and N}. informs N}, and N},). In an n-dimensional 
system, there are n such phases. 


Cached presence bits are updated in a divide-and-conquer fashion. The updating node 
first informs its two nearest near-neighbors in the first dimension. Then all three nodes 
h 


inform their nearest near-neighbors in the second dimension. This continues until the nt 


dimension is complete. Figure 4-6 illustrates the cached presence bit update algorithm. 


4.3. Thread Search 


An idle processor initiates a thread search process, which traverses the tree looking for 
runnable threads to execute. It starts at the leaf node local to the idle processor. 

When examining a given node, a searcher first checks the state of the node’s presence 
bit. If it is set, then there is work somewhere in the sub-tree rooted at the node. If not, the 
presence bits of the nodes’ neighbors are examined. If none of them are set, the searcher 
starts again one level higher in the tree. If more than one searcher arrives at the same node, 


then only one representative continues the search, and the rest wait for the representative 
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to find work for them all. 

When a searcher encounters a node whose presence bit is set, the searcher goes into 
gathering mode. A gatherer collects work from all of the leaves under the node where the 
gatherer originated. It starts by requesting work from those of the node’s children whose 
presence bits are set. The children request work from their children, and so on down to the 
leaves. Each leaf node sends half of its runnable threads back to its parent, which combines 
the work thus obtained from all of its children, sending it on up to its parent, and so on, 
back up to the node where the gathering process originated. 

Finally, the set of threads thus obtained is distributed among the waiting searchers. As 
a representative searcher makes its way back down the tree to the processor that spawned 
it, at each level, it hands an equal share of the work off to other searchers awaiting its 


return. 


4.4 An Example 


Figures 4-7 through 4-15 illustrate the thread search process on an 8-ary 1-dimensional 
mesh. In all of these figures, X-Tree nodes are represented by circles labeled N!. Numbers 
inside the circles indicate the state of the presence bits associated with each node. X-Tree 
nodes are joined by thin lines representing near-neighbor links and thick lines indicating 
parent-child links. 

At each leaf node in the X-Tree, there is a square processor box labeled P; indicating 
that the leaf node is associated with processor 7. A shaded square indicates that the 
processor has useful work to do; a non-shaded square indicates an idle processor. Under 
each processor square is a small rectangle representing the runnable thread queue associated 
with the processor. A queue rectangle with a line through it is empty; a non-empty queue 
points to a list of squares indicating runnable threads. 

Throughout this example, we disregard the fact that presence bits are cached by parents 
and neighbors. This has no effect on the functioning of the search algorithm; it is simply 


an optimization that improves search costs. 
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Figure 4-7: Thread Search Example — I. 


We start with processors Po, P1, Po, Ps, Pa, Pg and P7 busy and processor Ps idle. 
All runnable thread queues are empty except for those on Po and Py, which contain six 


and three threads, respectively. 
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Figure 4-8: Thread Search Example — II. 


Ps, which is idle, initiates a thread search. The search process first examines P5’s leaf 
node (N2), where it finds no work. It then examines N?’s two nearest neighbors, N? and 
N@, finding both of their presence bits to be zero, so it continues one level higher in the 
tree with node N3. N#’s presence bit is zero, as are those of both its neighbors, N+ and 
N}, so the search continues one level higher at node NZ. N’s presence bit is zero, but its 


neighbor, N3 has its presence bit set, so the search goes into gathering mode. 
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Figure 4-9: Thread Search Example — III. 


The search process that originated on Ps initiates a gathering process starting at node 
N2. A request for work is sent to node N}, the one child of N? whose presence bit is set. 
Requests for work are then sent to nodes N? and NP, both children of Nj and both of 
which have set presence bits. 

Meanwhile, Pg has become idle, presumably because the thread it was executing either 
blocked or terminated. It initiates a second thread search process, which examines the 
presence bits of the following nodes: Nf’, N2 and N°, N+ and N+, finding them all to be 
zero. When it gets to node Né, it finds that the other search process is searching outside 
the immediate neighborhood, so it waits for the other searcher to return with work. The 


waiting searcher is indicated in the figure by the symbol S¢. 
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Figure 4-10: Thread Search Example — IV. 


The gathering process initiated at node NZ has reached the leaves of the tree and now 
goes into gather mode. Requests for work that went to nodes N? and N} cause half of 
the work on Po’s and P,’s queues to be detached and sent back up the tree towards the 
requester. 

Meanwhile, yet another processor (P7) has become idle. It initiates a third thread search 
process, which examines the presence bits of nodes N? and N®, which are both still zero. 
When it gets to node N+, it waits for the second searcher to return with work. The waiting 


searcher is indicated in the figure by the symbol 57. 
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Figure 4-11: Thread Search Example — V. 


The gather process continues. The threads taken from processors Pg and P, are com- 


bined at node N} and sent on up to node N3, where the gathering process started. 
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Figure 4-12: Thread Search Example — VI. 


The threads are sent back to node NZ, where they are split between searcher $5, which 
initiated the gathering process, and searcher Sg, which waited back at node Né@ for S5’s 


return. 
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Figure 4-13: Thread Search Example — VII. 


Searcher Ss; sends its threads one link closer to P5. Meanwhile, the work brought back 


by searcher Sg is split up between Sg and $7, which was waiting at node N} for S¢’s return. 
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Figure 4-14: Thread Search Example — VIII. 


All searchers have returned to their originating processors. At each processor, the first 
thread brought back is run and the rest are put on the runnable thread queue. Note that 
the presence bits in the tree have not yet been updated to reflect the new work that has 


just become available. 
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Figure 4-15: Thread Search Example — IX. 


A tree update process is sent out from Ps because its runnable thread queue, which was 
previously empty, now contains threads that may be run by other processors. The process 
updates nodes N?, N} and N@. N}#’s presence bit was already set and therefore did not 


have to be modified. 


This example illustrated the thread search algorithm in action. Note that a single search 


process brought back work for three searching processors from the same neighborhood. 


In this chapter, we gave a detailed description of the two algorithms at the heart of 
XTM: the global information update algorithm and the thread search algorithm. In the 


next chapter, we give asymptotic analyses for the two algorithms. 
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Chapter 5 


Analysis 


In this chapter, we analyze the various pieces of XTM in terms of both execution time and 
network bandwidth consumed. In particular, we look at the thread search and presence- 
bit-update algorithms, analyzing the one-dimensional case, the two-dimensional case and 
the n-dimensional case for each. 


Our most important results are the following: 


1. On a machine with a sufficiently high, balanced workload, the expected cost of main- 
taining presence bits in the X-Tree is proved to be asymptotically constant, regardless 


of machine size. 


2. The algorithm that matches runnable threads with idle processors is shown to be eight- 
competitive with an idealized drafting-style adversary, running on a two-dimensional 


mesh network. 


3. The message-passing communication style is shown to yield fundamental improve- 
ments in efficiency over a shared-memory style. For the matching process, the ad- 
vantage is a factor of logl, where | is the distance between an idle processor and the 


nearest runnable thread. 


In addition, we give asymptotic cost bounds for X'TM’s search and update algorithms on 
one-, two- and n-dimensional mesh networks. We give results in terms of maximum latency 


and bandwidth requirements. 
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Unless otherwise stated, we assume that an efficient message-passing mechanism is em- 
ployed, #.e., a message can be sent from any processor in the machine to any other with no 
acknowledgment necessary [15]. When such a message represents an active entity moving 
about the system, then decisions about where it is to be sent next can be made without 
having to communicate with the processor from which the message first originated. 

In addition, all analyses of search costs in this chapter assume that information in the 
tree is accurate. This is in fact an approximation: due to delays in the communications 
network, in a dynamic environment, it is impossible to guarantee that dynamic data struc- 
tures are consistent at all times. The logic of the presence bit update algorithm guarantees 
that soon after the global state stops changing, the information in the tree will be accu- 
rate and consistent. It would be interesting to speculate as to the effect of the temporary 
inaccuracies of the information in the tree. 

Finally, all analyses assume that local computation is free; all costs are measured in 
terms of communication latency. To obtain execution time, we look at the (serialized) 
communication requirements of the algorithm’s critical path, assuming that non-critical- 
path pieces of the algorithm do not interfere with the critical path. We also assume that 
all messages are the same size, therefore consuming the same network bandwidth per unit 
distance. This is again an approximation: in fact, the more work that is moved, the more 
network bandwidth is consumed. However, for the Alewife machine, most of the bandwidth 
consumed by a message is start-up cost: it takes far longer to set up a path than it does 
to send flits down a path that has already been set up. For this reason, we stay with the 


constant-message-size approximation. 
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5.1 One-Dimensional Case 


The one-dimensional case is easiest to illustrate. This section presents in-depth analyses of 
the various pieces of the thread-distribution algorithm as implemented on a one-dimensional 
mesh. In Section 5.2, we extend the analyses to two dimensions; in Section 5.3, we perform 


the same analyses for an n-dimensional mesh. 


Throughout this section, the minimum distance d between two nodes on processors P; and 
P; is simply |j — |: 
d=D(M ,N?) =|j-4 


This is the lowest possible cost for communicating between P; and P;. 


Also, when following the shortest path through the tree between nodes N? and N. 3 we have 
to ascend I levels. J, which is an integer, is either | log, d| or [log, d], depending on how the 


two nodes are aligned with respect to the tree. 


[logy d] <1 < [logs d] 


Finally, L is the height of the tree: 


[logok| < L < [loggk] 


5.1.1 Search 


The search algorithm can be executed using either the message-passing style or the shared- 
memory style. In this section, we derive lower bounds on the latency using both styles. 
We show that for the message-passing style, the X-Tree-guided search algorithm is four- 
competitive with the optimal. For the same algorithm, the shared-memory style yields 
results that are worse by a factor of logd, where d is the distance between a processor and 


the nearest available work. 


Message- Passing 


Here, we show that the cost of searching the X-Tree for nearby work is four-competitive 


with the optimal adversary. In a k-ary one-dimensional mesh, a path from a node NP to 
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node N ; is established by ascending I levels in the tree from node N°? to its level-I ancestor 
Al, crossing over to Al and descending back down to node NP. The following analysis 


demonstrates the competitive factor of four. 


The distance covered when taking shortest path through the tree between nodes is the 


following: 
X(NP,NP) = D(NP, At) +D (Ai, 45) + D (4j,NP) 


l-1 l-1 
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Therefore, the traversal cost for the X-Tree is no more than a factor of four worse than 
direct access through the mesh. 

So far, we have given an upper bound on the tree traversal cost between two leaf nodes. 
The actual search process is a bit more complicated for two reasons. First, once a runnable 
thread is found, it is has to travel back to the requesting processor. Second, since the 
X-Tree is used for combining as well as search-guiding purposes, it employs a gathering 
process once work is located, so that half of the work on the subtree with work is sent 
back to the requesting sub-tree, to be distributed among all requesting processors from that 
sub-tree. The gathering process first broadcasts requests to the entire sub-tree rooted at the 
node and then waits for replies, which are combined at the intermediate nodes between the 
source of the gathering process and the leaves of the sub-tree rooted at that node. Since the 
gathering process executes in a divide-and-conquer fashion, the latency for the gathering 
process is the same as the time it takes for a single message to be sent to a leaf from the 
originating node, and for a reply to be sent back. Therefore, the critical path for the search 
algorithm costs at most 8d, which is twice as much as the worst-case traversal calculated 
above. 


The optimal adversary simply sends a request to the nearest processor containing work. 
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Figure 5-1: | One-Dimensional Search — Worst Case: A direct path from node A (N?) to node 
B (N?,) covers five network hops. The corresponding tree traversal covers 15 network hops: N2 — 
Ni > N2 > N}? > N}?, 7 N2, > Ni, — NP. 


The destination processor then sends some of its work back to the requester. If the request- 
ing processor is P; and the destination processor is P;, then this entire process has a cost 
of 2d. This gives us a competitive factor of four when comparing the X-Tree-based search 
to the optimal adversary. 

Figure 5-1 gives an example of a worst-case tree traversal scenario on a 16-processor 
one-dimensional mesh: the tree traversal is three times more expensive than the direct 


path. Similar examples on larger meshes approach the worst-case factor of four. 
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We now derive the worst-case network bandwidth requirement for a search that begins 
at N? when the closest work is at NP. The bandwidth consumed is broken into four pieces: 
ascending the tree to the lowest level that has a neighbor that has work, sending a “gather” 
message to that neighbor, gathering half the available work from that neighbor, and sending 


that work back to the original source of the request. 


I-1 I-1 l-1 
0 0 a—1 l l-a2ogx-1 l a—1 
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< (+6) (2-1) 
< (7 + logs d) (2Me2 41-1) 
< (7+ logy d)d 


The network bandwidth consumed by the search process is O (dlog d), as compared to a 
running time of O (d). This is the expected behavior, due to the use of a gathering process 


to collect threads from the entire sub-tree rooted at Al. 
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Shared-Memory 


When shared-memory processing is used, the matching algorithm becomes more expensive. 
The path from a node N? to node NP is established by ascending / levels in the tree from 
node NP to its level-I ancestor A!, crossing over to Al and descending back down to node N. 
Each step in the algorithm requires communication between the original source processor 


P; and a different node in the X-Tree. 


x (NP,NP) = 52D (NP. A?) +20 (NP, 49) 
a=1 aol 
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In other words, for the shared-memory case, the cost is 2 (dlogd). This is more expensive 
than the idealized adversary by a factor of (logd). This cost derives from repeated long- 


distance accesses through the communications network as the search closes in on its quarry. 
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5.1.2 Presence Bit Update 


In this section, we derive lower bounds on the latency using both shared-memory and 
message-passing communication styles. We show that for the message-passing style, the X- 
Tree-guided update algorithm has a worst-case cost proportional to the network diameter, 
but an expected cost independent of machine size, given a machine load that is both balanced 
and sufficiently high. The asymptotic behavior of the update algorithm is not nearly as 
sensitive to the communication style as the search algorithm. We show that the worst case 


cost only goes up by a factor of two when the shared-memory communication style is used. 


Message- Passing 


Whenever a runnable thread queue changes state between empty and non-empty, that infor- 
mation is recorded in the form of presence bits in the tree. Three sub-tasks are performed 


at each level in the tree: 


1. Determine whether the presence bit at this node needs to be modified. If so, set this 


node’s presence bit to the new value and continue with step two. If not, exit. 


2. Direct this node’s neighbors to change their cached presence bit copies for this node 


to the new value. 


3. Inform this node’s parent — continue the update process by starting this algorithm at 


step one on the parent node. 


Step 1 incurs no cost: it only requires local calculations. Step 2 is not included in the 
critical path analysis because it is not in the algorithm’s critical path and can execute 
concurrently. Therefore, it is only the child-parent communication cost that affects the 


algorithm’s critical-path cost. 
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The update cost can be characterized in terms of worst case and expected behavior. 
We show that although the worst-case cost is proportional to the network diameter, the 
expected cost is independent of machine size for certain machine load conditions. First, the 
bad news: U, the worst-case tree update critical-path cost, can be as bad as O (k). We now 


show why this is the case. 
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Note that the worst-case tree update cost is proportional to the network diameter. We will 


find that this is also true in the general (n-dimensional) case. 
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Now, the good news: the expected tree update cost is much lower. We calculate the 
expected cost for a sufficiently loaded, balanced machine. By balanced, we mean that every 
thread queue has the same probability distribution for its queue length. By sufficiently 
loaded, we mean that there exists some non-zero constant lower bound on the probability 
that the queue is non-empty: A. In other words, every processor’s thread queue has a 
distribution of queue lengths such that the probability of non-zero queue length is at least 
A, for some A: Prob (qlen > 0) > A. In the following derivation, E[l/] signifies the expected 
update cost, Prob (lev = 7) represents the probability that the algorithm makes it to exactly 
the i*” level and C (i) is the expected cost of communicating between a node at the ith jevel 


and its parent. Finally, let w = (1 — 4). 
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Most of this derivation is straightforward algebraic manipulation. The only subtlety con- 
cerns the transition from are i ,(2') to yya1 ju). The second of the two expressions is 


simply a restatement of the the first expression, adding some (strictly positive) terms to 
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the sum and changing the summation index. 
This result shows that although the worst-case tree update cost is proportional to the 
network diameter, on a machine whose workload is both balanced and sufficiently high, the 


expected update cost is O (+). which does not depend on the network diameter. 


The network bandwidth requirement for the tree update process is somewhat higher; not 
only does it include child-parent communication costs, but it also includes costs incurred 


when updating neighbors’ cached presence bits. 
L-1 
BU) < k+>°2(2') 
i=0 


< k+2 (2/8 Ft) 


< k+4k 


Like the critical-path cost, the worst-case bandwidth requirement for the tree update process 
is O(k), differing from the critical-path cost by a constant factor only. We will find that 
this constant factor is a function of n, the network dimensionality. 

The expected bandwidth requirement for the tree update process is also significantly 


lower than the worst case: 


E[B(U)]| = E[U] + E[B (neighbor-cache update)| 
< SA fea] + Eee) a 
< CAs] 


On a machine whose workload is both balanced and sufficiently high, the expected band- 
width requirement is O (x). which is independent of k, the network diameter. Note that 
the expected update bandwidth differs from the critical-path bandwidth only by a constant 


factor. In this case as well, we will see that this factor is a function of n. 
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Shared-Memory 


For the matching process, the cost became fundamentally higher when going to the shared- 
memory programming style. For the tree update process, this is not the case; the rise in 


cost is a constant factor. 
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In the case of message-passing, the worst-case tree update cost was O(k). Here in the 


shared-memory case, the worst-case update cost is also O (k). 
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5.2 Two-Dimensional Case 


This section presents in-depth analyses of the various pieces of the thread-distribution al- 
gorithm as implemented on a two-dimensional mesh. Two-dimensional mesh networks pro- 
vide more communication capacity than one-dimensional networks. Also, both one- and 
two-dimensional networks can scale to arbitrarily large sizes without encountering any the- 
oretical snags related to wire lengths, wire densities or heat dissipation. Networks of three 
or more dimensions have problems existing in real space, due to wire packing, wire lengths 


and heat dissipation issues. 
Throughout this section, the minimum distance d between two nodes on processors Pj, i, 
and Pj,;, is simply |jo — 20] + |g1 — 24: 
d=D (M2, Nis,) = lio — tol + li — 41 
This is the lowest possible cost for communicating between P;,;, and P¥j,,;,. 


Also, when following the shortest path through the tree between nodes N? , and N? 


Ost4 JO;I1? 


we have to ascend / levels: 
max ([logs |Jo — tol], [loge 191 — tal) < 1 < max ([logy |Jo — tol], [loge |o1 — 4111) 


[max (logy |jo — to], logy |j1 — 41])| <1 < [max (logy |Jo — to], logs [41 — 41/)| 
[logy d] <1 < [logy d] 
The exact value for / depends on how the nodes are aligned with respect to the tree. 


Finally, D is the height of the tree: 


[logok| < L < [loggk] 
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5.2.1 Search 


In this section, we derive lower bounds on the latency and bandwidth costs using the 
message-passing style. We also show that for the message-passing style, the X-Tree-guided 
search algorithm is eight-competitive with the optimal. For the same algorithm, the shared- 
memory style yields results that are worse by a factor of logd, where d is the distance 
between a processor and the nearest available work. Proof of this fact follows the derivation 


for the one-dimensional network. 


Message- Passing 


In this section, we show that the cost of searching the X-Tree for nearby work is eight- 
competitive with the optimal adversary. In a k-ary two-dimensional mesh, a path from a 


node NP i, to node NP is is established by ascending / levels in the tree from node NP in 


L The 


1 
to node A‘ ee 


ip.i, Crossing over to node A 
? 


: 0 
and descending back down to node Nj, ,,. 


following analysis demonstrates the competitive factor of eight: 
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Therefore, the traversal cost for the X-Tree is no more than a factor of eight worse than 
direct access through the mesh. 

As is shown for the one-dimensional case, both for the X-Tree algorithm and for the 
optimal adversary, a search costs twice as much as a simple message send from source to 
destination. Therefore, the competitive factor of eight holds for a search as well as for a 


one-way message send. 
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Figure 5-2: Two-Dimensional Search — Worst Case: A direct path from node A (NP, 15) to 
node B (NPs, 4) covers ten network hops. The corresponding tree traversal covers 62 network hops: 
4 4 : 

N¥o,15 = Niv1s = Nes.14 > N3o,12 > Noag > Nga > N?o 98 = N?4,96 = Nis,28 = Nos, 24: 


Figure 5-2 gives an example of a worst-case tree traversal scenario on a 32 by 32 mesh: 
the tree traversal is more than six times more expensive than the direct path. Similar 


examples on larger meshes approach the worst-case factor of eight. 
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We now derive the worst-case network bandwidth requirement for a search that begins 


when the closest work is at N 7 ote 


B (Nei M25) < ( at: > 2 Ca) ch) (2') ai 
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Although the latency for the search is O (d), where d is the distance between the searcher and 
the nearest work, the network bandwidth consumed is O (d? log d). This is not surprising, 
due to the use of a gathering process to collect threads from the entire sub-tree rooted at 


I 
iat 


Shared-Memory 


The shared-memory analysis for the two-dimensional case follows the message-passing anal- 
ysis in the same way that it does for the one-dimensional case. The resulting cost goes up 


by a factor of log d for the matching algorithm. 
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5.2.2 Presence Bit Update 


In this section, we derive lower bounds on the latency and bandwidth using the message- 
passing communication style. We the X-Tree-guided update algorithm has a worst-case cost 
proportional to the network diameter, but an expected cost independent of machine size, 
for certain load conditions. We argue that the worst case cost only goes up by a factor 
of two when the shared-memory communication style is used. The derivation of this fact 


follows that given for the one-dimensional case. 


Message- Passing 


The presence bit update analysis for two dimensions is similar to that for one dimension. 
First, the bad news: the worst-case tree update critical-path cost can be as bad as O (k). 


We now show why this is the case: 
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Again, we find that the worst-case tree update cost is proportional to the network diameter. 
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Now, the good news: again, the expected tree update cost is much lower on a machine 
whose workload is both balanced and sufficiently high. These load conditions are expressed 
by any distribution of thread queue lengths such that the probability of non-zero queue 
length is at least A, for some » : Prob (qlen > 0) > A. 
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This result shows again that although the worst-case tree update cost is proportional to the 
network diameter, on a machine whose workload is both balanced and sufficiently high, the 


expected cost is O (x). which does not depend on the network diameter. 
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The network bandwidth requirement for the tree update process is somewhat higher: 


not only does it include child-parent communication costs, but it also includes costs incurred 


when updating neighbors’ cached presence bits. 
Peal 
BU) 
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The worst-case bandwidth requirement for the tree update process is again O (k). 


The expected bandwidth requirement is significantly lower: 


E[B(Y)| = EU] + E[B (neighbor-cache update)| 
1-2 ae ee ee 
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The expected bandwidth requirement is again O (sr). 


Shared-Memory 


The shared-memory analysis for the two-dimensional case follows the message-passing anal- 


ysis in the same way that it does for the one-dimensional case. The resulting cost goes up 


only by a constant factor. 
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5.3 n-Dimensional Case 


This section generalizes the analyses presented in the previous two sections. Here, we look 
at the behavior of the X-Tree algorithm on an n-dimensional mesh, for any n. We only 
examine the case of message-passing: the shared-memory costs go up by a factor of log! for 


search and a constant for update. 


Throughout this section, the smallest number of hops d between two nodes on processors 


Panis and Lg ee ae is simply Jo t| 71 a4| Poe [Fa Bee! 
23 0 0 
d = D (NG ite Neg) 


= |Jo — 40] + [91 — t1| +... + |jn—1 — tn—1] 
This is the lowest possible cost for communicating between Pj, j,...i,-, and Pjg.i\ jn —1- 


When following the shortest path through the tree between N? and N? 


40,81 )--stn—1 JO 1 y-yUn—1? 


we have to ascend / levels: 
max ([logy |jo — tol |s ++ [ogy Ljn—1 — inal) <1 < max (flogy [jo — iol], --s [logy Lin—1 — inal) 


[max (log, ldo = tol, ++ 1085 lJn—1 = in—1|) | as [max (logs |Jo = tol, .-, logs eee = in—1|)] 


Llogigny) d] <1 < [log, d] 
The exact value for / depends on how the nodes are aligned with respect to the tree. 


Finally, L is the height of the tree: 


[logok| < L < [logok] 
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5.3.1 Search 


In this section, we show that the cost of searching the X-Tree for nearby work is n- 


competitive with the optimal adversary. In a k-ary n-dimensional mesh, a path from a 


0 . . . . 
node Nj, 5... , to node NP, jisjn_1 18 established by ascending / levels in the tree from 
node NR, ie: _ to node Aj. ‘ityein_1? CTOSSINg Over to node Al jiyenjn-1 2nd descending 
back down to node N? FOsjtyenin_1- He following analysis demonstrates the competitive factor 
of n: 
0 0 0 
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+D (Abn erwin a Ay JOT oIn— ) 
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< [1+ >on (27-7)/ +n (2!) + fat Son @)] 
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< And 


Therefore, the traversal cost for the X-Tree is no more than a factor of 4n worse than direct 
access through the mesh. 

As is shown for the one- and two-dimensional cases, both for the X-Tree algorithm and 
for the optimal adversary, a search costs twice as much as a simple message send from 
source to destination. Therefore, the competitive factor of 4n holds for a search as well as 


for a one-way message send. 
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We now derive the worst-case network bandwidth requirement for a search that begins 


at. sce ee an when the closest work is at ING eae 
feat 
B (NR atweuin aa? No duveadn1) S ( +Sin @=)) 
v=], 
nf) 
a 
+ [14d [ex] et) 
x2=0 
<n ime Soha Cai 
< nfl+3) [ary] 
<_n[B + flog, dl] [(2")hoe 4) 
< n[4+ logs d|d” 


Although the running time for the search is O (nd), where d is the distance between the 
searcher and the nearest work, the network bandwidth consumed is O (nd” log d). Again, 
this is expected behavior, due to the use of a gathering process to collect threads from the 


: l 
entire sub-tree rooted at Aj. yy, 4: 
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5.3.2 Presence Bit Update 


The presence bit update analysis for n dimensions is similar to that for one and two dimen- 
sions. First, the bad news: the worst-case tree update critical-path cost can be as bad as 


O (nk). We now show why this is the case. 


a 
u< n+ 52 
i=0 
< ngllog2k|-1 
< nk 


Yet again, we find that the worst-case tree update cost is proportional to the network 


diameter. 
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As usual, the expected tree update cost is much lower on a machine whose workload is 
both balanced and sufficiently high. As before, these load conditions are expressed by any 
distribution of thread queue lengths such that the probability of non-zero queue length is 


at least A, for some \ : Prob (qlen > 0) > X. 
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This result shows that although the worst-case tree update cost is proportional to the 
network diameter, on a machine whose workload is both balanced and sufficiently high, the 


expected cost is O (4). which does not depend on the network diameter. 
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The network bandwidth requirement for the tree update process is somewhat higher; not 
only does it include child-parent communication costs, but it also includes costs incurred 


when updating neighbors’ cached presence bits: 
ies 
BU) < k+ 5° (3"-1)2 
i=0 
< k+ (3-1) (2!e241) 


< k+(3"—1)k 


The worst-case bandwidth requirement for the tree update process is O (3”k). 
The expected bandwidth requirement is significantly lower on a machine workload is 


balanced and high enough: 


E[B(U)| = EU] + E[B (neighbor-cache update)| 
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The expected bandwidth requirement is O (3). 
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Chapter 6 


Experimental Method 


This chapter describes experiments carried out in the course of this research. This descrip- 
tion covers three main areas: simulation environments, thread management algorithms and 
applications. 

Our tests employed two simulation environments. The first, named NWO, is an accurate 
cycle-by-cycle simulator of the Alewife machine. NWO is the primary vehicle for Alewife 
system software development. The second, named PISCES, is a faster but less accurate 
simulator built specifically for this research. PISCES was the main data-gathering apparatus 
for this thesis, allowing us to simulate systems of up to 16384 processors. We extracted 
the parameters used to drive PISCES from simulations run using NWO, as described in 
Section 6.1. 

A number of thread management algorithms were implemented to run on PISCES. These 
include the X-Tree algorithm used by XTM, two other combining-tree algorithms (TTM 
and XTM-C), two diffusion-based thread managers (Diff-1 and Diff-2), two round-robin 
thread managers (RR-1 and RR-2) and four idealized thread managers (Free-Ideal, P- 
Ideal, C-Ideal-1 and C-Ideal-2). All of these thread managers share a single queue 
management discipline, which tends to increase thread locality and avoid memory over- 
flow problems, while encouraging a uniform spread of threads around the machine. See 
Section 6.4 for the details of the various thread managers. 

Five applications were run on PISCES under the various candidate thread managers: a 


two-dimensional integrator that employs an adaptive quadrature algorithm (AQ), a branch- 
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Figure 6-1: The Alewife Machine. 


and-bound solver for the traveling salesman problem (TSP), doubly recursive Fibonacci 
(FIB), a block matrix multiply (MATMUL), and a synthetic unbalanced application (UN- 
BAL) which tests system behavior in the face of an initially poor load distribution. Details 


of the various applications are given in Section 6.5. 


6.1 NWO: The Alewife Simulator 


Alewife is an experimental multiprocessor being developed at MIT (see Figure 6-1). Alewife 
is primarily a shared-memory machine, containing coherent caches and a single shared 
address space for all processors. Alewife also supports efficient interprocessor messages, 
allowing programs which use a message-passing communication style to execute as efficiently 
as shared-memory programs. 

At the time of this writing, hardware development for the Alewife machine is nearing 
completion. While actual hardware is not yet available, a detailed cycle-by-cycle simulator 
for Alewife is the primary vehicle for system software development. This simulator, dubbed 
NWO, performs a cycle-by-cycle simulation of the processors, the memory system and the 
interprocessor communications network, all of which will eventually be present in the actual 


Alewife hardware (see Figure 6-2). NWO is faithful enough to the Alewife hardware that 
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Figure 6-2: NWO Simulator Organization. 


Simulated Application 


Entry Point. ———————»_ Dispatcher 


| 


Time—Ordered Thread Queue 


Figure 6-3: PISCES Multithreader Organization. 


it has exposed many Alewife hardware bugs during the design phase. 


NWO ’s primary drawback is its slow execution speed. NWO provides accuracy, at the 
cost: of relatively low performance: on a SPARC-10, simulations run at about 2000 clock 
cycles per second. This means that a typical 64-processor simulation runs approximately two 
million times slower than it would on the actual hardware. For this reason, it is impossible 
to run programs of any appreciable size on NWO. Therefore, for the purposes of this thesis, 
NWO was primarily employed as a statistics-gathering tool. Parameters from NWO runs 


were used to drive PISCES, a faster, higher-level simulator, described in the next section. 
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begin 
(suspend <x>) 


(suspend <y>) 


(suspend <z>) 
end 


Figure 6-4: A Typical PISCES Thread. 


Simulated 
time 


Start 


Figure 6-5: PISCES Thread Execution Order. 
6.2 The PISCES Multiprocessor Simulator 


The PISCES multiprocessor simulator is a general, low-overhead multiprocessor simulation 
system, consisting of a simple multithreading system and a machine model. The entire 
system is written in T [23], a dialect of LISP. 

The multithreader diagrammed in Figure 6-3 supports multiple independent threads 
of computation executing in a common namespace. Each thread is a T program with its 
own execution stack. A typical thread consists of a series of blocks of code, separated 
by expressions of the form (suspend <t>). Each suspend expression informs the system 
that the associated code block requires t cycles to run (see Figure 6-4). The suspend 
mechanism is the only way for a thread to move forward in time; all code run between 
suspend expressions is atomic with respect to the rest of the simulation. 

PISCES threads are sorted into a time-ordered queue. The multithreader takes the first 
thread from the queue, runs it until it encounters the next (suspend <t>) expression, and 


re-enqueues it to execute t cycles later (see Figure 6-5). New threads are created by calling 
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Figure 6-6: PISCES Alewife Simulation. 


(make-ready <proc> <t>), which creates a new thread to run t cycles from the current 
simulated time, and which will run procedure proc when it first wakes up. When proc 
finishes, the thread terminates. 

The PISCES Alewife machine model diagrammed in Figure 6-6 is built on top of the 


PISCES multithreader. A simulated processor can be in one of three states: 


Thread Manager A processor begins its life running the thread manager. It continues 
in this state until it finds an application thread to run. At this point, the thread 
manager associated with this processor is put aside and the processor begins to run 


the application thread. 


Application The threads that make up the application being run on the simulated machine 
are implemented as PISCES threads. An application thread continues to run until it 


either terminates or suspends on a synchronization datatype. 


Interprocessor Message Messages that are sent between simulated processors are also 
implemented as PISCES threads. When a simulated processor sends a message to 
another processor, it creates a new PISCES thread to be run on the other processor 
c cycles in the future, where c is the number of cycles that it takes to send a message 
between the two processors. At that time, the destination processor is interrupted, 


and the message thread is executed. The interrupt mechanism is described below. 


PISCES threads run until they release control through suspend expressions. Further- 


more, application threads can be suspended by performing operations on certain synchro- 
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Default Value 


Te cycles 
TS cycles 
TS eyeles 
TB eyeles 
GT cycles 
Td eyeles 


Time to Suspend a Thread 99 cycles 
BB oydles 
3 cycles 
B cycles 
cycles 


Thd-Load-Ovh | Time to Load a New Thread 29 cycles 


Table 6.1: Timing Parameters: Obtained from NWO simulations. 


nization datatypes, described below. Since this is a uniprocessor simulation, nothing can 
change the state of the simulated machine while a given thread is running (between suspend 
or synchronization operations). Therefore, users of the PISCES multithreading package are 
encouraged to keep blocks between suspend calls small, in order to improve the accuracy 


of the simulation. 


A processor can be interrupted at any time. When an interrupt message is received, 
the associated interrupt thread is executed at the next break. This could result in poor 
timing behavior in the presence of long suspend operations: if a thread is suspended for 
a long period of time when an interrupt is received, the interrupt will not be processed 
until the thread resumes execution and is then suspended again. For this reason, all “long” 
suspend operations are executed as a series of shorter suspend operations. Currently, the 
suspend quantum is ten cycles: this seems to give a fair balance between performance of 


the simulation and accuracy of timing behavior. 


6.2.1 Timing Parameters 


The PISCES Alewife simulation requires a number of timing parameters to be set. These 
parameters describe the timing behavior of the machine being simulated. For the purposes 
of this thesis, these parameters were obtained through measurements of NWO simulations. 


They are summarized in Table 6.1. 
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6.2.2 Inaccuracies 


The PISCES Alewife simulation executes from one to two orders of magnitude faster than 
NWO. This gain in performance has a cost: PISCES simulations contain a number of 
inaccuracies not found in NWO simulations. 

First, cache behavior is ignored unless specifically modeled in the application (see de- 
scription of Cached MATMUL given below). All timing figures given in Table 6.1 assume 
cache hits, except in those sections of the code where the cache is guaranteed to miss, in 
which case local cache miss timing is assumed. Furthermore, all data accesses are free, 
except those whose cost is explicitly modeled through suspend expressions. For most of 
the applications described below, all data accesses are to the execution stack, which is in 
local memory and usually resident in the cache, thus minimizing the effect of this inaccurate 
behavior. 

Second, the network model assumes no contention. The costs associated with a message 
send include fixed source and destination costs and a variable cost depending on the size of 


the message, the distance between source and destination, and the network speed: 
MsgCost = MsgSendOvh + MsgRcvOvh + [MsgSize + Dist(Src > Dst)] x NetSpeed 


Third, as discussed above, the behavior of suspend calls affects the timing behavior 
of the entire simulation. Blocks of code executed between suspend calls appear atomic to 
the rest of the simulation. Furthermore, any inaccuracies in the argument to the suspend 
associated with a block show up as inaccuracies in the running time of that thread. Finally, 
interrupts can only occur at suspend quantum boundaries. This means that if a message 
shows up at a processor at time t, it might not be executed until time t+ q, where q is 
the suspend quantum. For all data given in Chapter 7, this quantum was set to ten cycles, 


which is small enough to be insignificant. 


6.2.3. Synchronization Datatypes 


The PISCES Alewife simulation system supports a number of synchronization datatypes, 
including j-structures, l-structures, and placeholders. These make up a subset of 


the datatypes provided by Mul-T [12]. 
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J-structures are arrays whose elements provide write-once semantics. A read oper- 
ation on a j-structure element is deferred until a write to that element takes place. A 
processor issuing such a read is suspended on a queue associated with the element. A 
write operation to an unwritten j-structure element writes a value and then frees all 
threads suspended while waiting for the write to take place. A write operation to a written 
j-structure element causes an error. 

L-structures are arrays of protected data. A read performs a lock operation on the 
addressed element, returning the associated piece of data when it succeeds. A thread 
that attempts to read an 1-structure cell that is already locked is suspended on a queue 
associated with that cell. A write performs a unlock operation on the addressed element, 
reenabling any threads suspended on that element through failed reads. A write to an 
unlocked 1-structure element causes an error. 

Placeholders are used for communication between parent and child in a future call. 
Conceptually, a placeholder consists of a value, a queve and a flag, which signifies 
whether the data item in the value slot is valid. When a placeholder is first created, 
the flag is set to empty. To change the flag to full, a determine operation must be 
performed on the placeholder. When a future call occurs, a child thread is created, 
and an associated placeholder is returned to the parent. If the parent tries to read the 
value associated with the placeholder before the determine operation has taken place, 
the parent is suspended on the placeholder’s queue. When the child thread terminates, 
it determines the value of the placeholder, reenabling any associated suspended threads. 


The semantics of a placeholder are very similar to those of a single j-structure cell. 


6.3 Finding the Optimal Schedule 


When evaluating the various candidate thread management algorithms, it is desirable to 
have a standard to compare the candidates to. The “best” schedule would be ideal for this 
purpose, but, as discussed in Chapter 2, the general thread management problem is NP- 
hard, even when the entire task graph is known. For examples of the size we’re interested 
in, this effectively makes the optimal schedule impossible to obtain. 


However, we can find a near-optimal schedule in most cases. The approach we’ve taken 
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is three pronged: 


1. For all applications, we know the single-processor running times and we can calcu- 
late the critical path lengths. An “optimal” T-vs.-p curve is derived from these two 


numbers as follows: 


Ti 
Tp = max cru Torit) 


where T; is the running time on one processor, p is the number of processors and T’..;+ 
is the critical path length. This yields a relatively tight lower bound on the best-case 


running time, and shows up in the data given in Chapter 7 under the label Ideal. 


2. For certain applications, a near-optimal static schedule can be derived from the regular 
structure of the application. This approach gives a relatively tight upper bound on 
best-case running times when applied to the UNBAL and MATMUL applications 


described below. 


3. For applications for which a good static schedule is not practical to obtain, we employ 
a more empirical approach. A number of the thread managers described below are not 
physically realizable. Such idealized thread managers can assume, for example, that 
the state of every processor’s thread queue is known instantaneously on every other 
processor. Since we are running simulations, implementing such unrealistic thread 


managers is straightforward. 


Together, these three approaches yield an estimate of the running time that could be 


achieved by an optimal schedule. 


6.4 Thread Management Algorithms 


A number of candidate thread management algorithms were tested for comparison against 
XTM. These algorithms can be split into two groups: realizable and unrealizable. The 
realizable algorithms are those that can be implemented on a real multiprocessor; the un- 
realizable algorithms make unrealistic assumptions that preclude their use in an actual 


multiprocessor. The unrealizable algorithms make it possible to gauge the effect of certain 
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thread management costs by eliminating those costs, yielding corresponding performance 


figures. 


6.4.1 Unrealizable Algorithms 


Calculated: The Ideal performance figures given in Chapter 7 are not taken from any 


thread manager at all; they are calculated as described above in Section 6.3. 


Free Idealized: Free-Ideal employs a single global thread queue. Push and pop opera- 
tions to this queue are free, can go on concurrently. This algorithm is implemented 
to get a near-lower-bound on application running times, discounting communication 


and queue contention costs. 


Idealized Producer-Driven: P-Ideal gives every processor instantaneous knowledge of 
the state of the thread queue on every other processor. When a new thread is created 
on a given processor, it is sent to the processor with the least amount of work on its 
queue, where the cost of moving the thread (which increases with distance) is added 
to the processor’s perceived workload, so as to make distant processors less attractive 


than nearby processors. 


Idealized Consumer-Driven 1 — steal-one: We look at two variants of an idealized con- 
sumer-driven thread manager: C-Ideal-1 and C-Ideal-2. Both versions allow every 
processor to have instantaneous knowledge of the state of the thread queue on every 
other processor. This information is used by idle, consumer processors, instead of 
busy, producer processors. An idle processor steals one thread from the nearest pro- 
cessor that has work on its queue by sending a steal message to that processor. If 
more than one processor at a given distance has work, the processor with the most 


work on its queue is selected. 


Idealized Consumer-Driven 2 — steal-half: C-Ideal-2 has exactly the same behavior as 
C-Ideal-2, save for one difference concerning the number of threads moved during a 
steal operation. C-Ideal-1 only moves one thread at a time, while C-Ideal-2 moves 
half of the threads from the producer processor’s queue to the consumer processor’s 


queue. 
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6.4.2 Realizable Algorithms 


Static: Stat is the simplest of the real thread managers. Every thread created is sent 
to a specific processor on the machine, as directed by the code itself. This is useful 
for examining static schedules, produced either by some compiler or explicitly by the 


user. A processor only runs those threads that are assigned to its queue. 


Round-Robin-1 — steal-one: The round-robin thread manager comes in two forms, RR-1 
and RR-2, both of which use pseudo-random drafting strategies. A processor scans 
the queues of all other processors in a near-to-far order obtained by successively XOR- 
ing its PID with each number between 0 and p—1, where p is the number of processors 
in the machine. In this manner, each processor scans the machine in its own unique 
order. This order proceeds from near to far due to the mapping from PID to mesh 
location used by the Alewife processor [15]. In RR-1, when a non-empty queue is 


found, one thread is moved from that queue to the processor making the request. 


Round-Robin-2 — steal-half: As was the case for C-Ideal-1 and C-Ideal-2, the only 
difference between RR-1 and RR-2 concerns the number of threads that are moved. 
While RR-1 moves only one thread at a time, RR-2 moves half of the threads from 


producer processor to consumer processor. 


Diffusion-1: Diffusion scheduling is suggested in [9]. In both Diff-1 and Diff-2, every 
h cycles, a diffusion step is executed on each processor (the default value for h is 
1000, which seems to give the best tradeoff between thread-manager overhead and 
effectiveness). On each diffusion step, a processor compares its queue length with those 
of its four nearest neighbors. For each neighbor, if the local processor’s queue is longer 
than the neighbor’s queue, r threads are sent to the neighbor, where r = (ota) +3 Io 

is the length of the local processor’s queue and J, is then length of the neighboring 

processor’s queue. 

(lo—In)+3 


6 


Over-Relaxation [3] with relaxation parameter 7 set to 2. The 3 in the numerator 


The particular choice of the “6” in the expression r = comes out of a Jacobi 


is present to ensure roundoff stability: for any number greater than three, a single 


thread can bounce back and forth between processors. 
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Diffusion-2: Diff-2 only differs from Diff-1 in that the expression to determine the 


(lo—In)+5 (lo—In)+3 
6 6 


amount to diffuse from a processor to its neighbor is r = , not r= 
as for Diff-1. The difference in performance can be dramatic, because Diff-2 spreads 
work more uniformly around the machine. This increased performance comes at the 


price of roundoff stability: Diff-2 is not completely stable in that on alternate diffusion 


steps, a single thread can bounce back and forth between two processors. 


Single-Bit X-Tree: XTM is described in Chapter 4. One bit of presence information is 
maintained at each node in the tree, signifying whether there is any work available to 


be stolen in the subtree headed by that node. 


Simple Tree: TTM is a simple tree-based thread manager similar to XTM, except that 
the tree-structure employed has no nearest-neighbor links. This lowers the cost of 
updating the tree, since presence bit information doesn’t have to be sent to neighbors 
when the state of the presence bit changes. However, there is some loss of locality 


since tree nodes that are next to one another can be topologically distant in the tree. 


Multi-Bit X-Tree: XTM-C is a multi-bit X-Tree algorithm very similar to XTM, except 
that more than one bit of presence information is maintained at each node in the 
tree. In order to limit the cost of updating the “weights” maintained at the tree 
nodes, a node informs its parent only of weight changes that cross one of a set of 
exponentially-spaced thresholds. In this manner, small changes in small weights are 
transmitted frequently while small changes in large weights, which don’t matter as 
much, are transmitted less frequently. Furthermore, hysteresis is introduced into the 
system by choosing a different set of boundaries for positive and negative changes, in 
order to avoid repeated updates resulting from multiple small changes back and forth 


across a single boundary. 


The node weights add a degree of conservatism to the algorithm employed by XTM. 
When a search process encounters a node with work, it doesn’t always balance between 
the empty node and the non-empty node. Instead, a searching node balances with a 
neighbor only if the amount of work brought back justifies the cost of bringing the 


work back. 
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Figure 6-7: Thread Queue Management. 


6.4.3 Queue Management 


Certain functional aspects are shared by all of the thread management algorithms described 
above. In particular, they all share the same queue discipline, as pictured in Figure 6-7. 
When a processor creates a new thread to be put on its own local queue, it puts the thread 
on the head of the queue. When one or more threads are moved from one processor to 
another, they are taken from the tail of one queue and put on the tail of the other. Threads 
taken from the local queue for execution are always taken from the head of the thread 
queue. This causes threads to be executed in depth-first order locally, thereby minimizing 
memory consumption, while moving parallel threads around the machine in a more breadth- 
first fashion, thereby spreading work around the machine efficiently (as suggested in [8]). 
Furthermore, since every processor has its own thread queue, the depth-first order tended 
to cause threads created on a processor to remain on that processor, reducing demands on 


the communications network. 


6.5 Applications 


The candidate thread managers described above were tested on a number of example ap- 


plications. These applications were chosen to fill the following requirements: 


1. Any candidate application has to run for a short enough period of time so as to make 


PISCES simulations practical. 


97 


2. Any application chosen should be “interesting.” An application is deemed to be inter- 
esting if for some machine size and problem size, near-linear speedup is possible when 
good thread management decisions are made. At the same time, “bad” thread man- 
agement decisions should yield poor speedup for the given machine size and problem 


size. 


3. The collection of applications chosen has to cover a range of behaviors deemed to be 


“typical” of dynamic applications. 


The actual code for these applications can be found in Appendix B. 
We first consider applications that are relatively fine-grained. Their task graphs are tree- 
structured; virtually all communication takes place directly between parents and children 


in the task tree: 


Numerical Integration AQ makes use of an Adaptive Quadrature algorithm for inte- 
grating a function of two variables. This algorithm, given in [22], has a task tree 
whose shape is determined by the function being integrated. The particular function 
integrated is a4y*, over the square bounded by (0.0, 0.0) and (2.0, 2.0). Problem size is 
determined by the accuracy threshold: higher accuracy requires more work to achieve 


convergence. 


Traveling Salesman TSP [18] finds the shortest path between randomly placed cities on 
a two-dimensional surface. In this case, problem size is determined by the number of 
cities scanned. The search space is pruned using a simple parallel branch-and-bound 
scheme, where each new “best” path length is broadcast around the machine. For 
problem and machine sizes tested, the detrimental effects of such broadcasts were 


small. 


This application is unique in that the total work depends on the order in which the 
search tree is scanned. For the largest problem size we explored (11 cities), differences 
in scanning order could result in up to a factor of four in total work. However, the 
actual differences in total work between different runs on various machine sizes using 


various thread management algorithms amounted to less than a factor of two. 
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Figure 6-8: | Coarse-Grained MATMUL Partitioning: A 16-by-16 matrix times a second 16-by-16 
matrix gives a third 16-by-16 matrix. Each matrix is partitioned up into 16 4-element by 4-element 
blocks. A single thread multiplies the entire top row of sub-blocks in the first matrix times the entire 
left-hand column of sub-blocks in the second matrix to calculate sub-block 0,0 of the destination 
matrix. 


The Ideal figures for this application were calculated assuming that the minimal 


search tree was scanned. 


Fibonacci FIB(n) calculates the nth Fibonacci number in an inefficient, doubly-recursive 


manner. In this case, n determines the problem size. 


Other applications in the test suite have more specific purposes. 


Matrix Multiply All of the applications described above have a very limited communica- 
tion structure. In order to test machine behavior for applications that contain more 
communication, a blocked matrix multiply application was included in the test suite. 
Four variations on the basic MATMUL were tried: coarse-grained and cached, fine- 


grained and cached, coarse-grained and uncached, and fine-grained and and uncached. 


The cached versions simulated full-mapped caches [4]. The uncached versions were 
tested in order to separate out the effect of caching on the application from the thread 


managers’ effects. 


Two partitioning strategies were employed for this application, as pictured in Fig- 
ures 6-8 and 6-9. The coarse-grained partitioning strategy takes advantage of locality 
inherent to the application. The fine-grained strategy potentially loses some of this 
locality, but gives the thread managers more flexibility by creating more than one 


thread per processor. 
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Figure 6-9: Fine-Grained MATMUL Partitioning: A 16-by-16 matrix times a second 16-by-16 
matrix gives a third 16-by-16 matrix. Each matrix is partitioned up into 16 4-element by 4-element 
blocks. One thread multiplies sub-block 0,0 of the first matrix with sub-block 0,0 of the second 
matrix to get a partial result for sub-block 0,0 of the destination matrix; another thread multiplies 
sub-block 0,1 of the first matrix with sub-block 1,0 of the second matrix to get another partial 
result for sub-block 0,0 of the destination matrix; and so on. The partial results for each destination 
sub-block are added together to achieve a final value; consistency is ensured using 1-structures to 


represent the destination sub-blocks. 


In all cases, the data was distributed around the machine in the obvious way, with 
the upper left-hand corner of each matrix residing on the upper left-hand processor, 


and so on. 


UNBAL UNBAL isa synthetic application whose purpose is to test the “impulse response” 
of a thread manager. In this application, a number of fixed-length, non-communicating 
threads are made to appear on one processor in the system. The program terminates 
when the last thread completes. This test gives some insight into a thread manager’s 
behavior with a load that is initially severely unbalanced. In all test cases , each 


thread ran for 500 cycles. 
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Table 6.2: Running Characteristics for Applications in the Test Suite: T, is the running time in 


cycles on one processor. T,,;; is the running time in cycles of the critical path. This is how long the 
program would take to run on an infinitely large multiprocessor with no communication or thread 
management overheads. The Average Grain Size is the average running time of the threads in the 
application, in cycles. 


6.5.1 Application Parameters 
Table 6.2 gives the running characteristics of the various applications in the test suite for a 


range of application sizes. 


In the next chapter, we describe the results of simulating these applications. We try to 
use those results to gain some insight into the behavior of the various thread management 


algorithms. 
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Chapter 7 


Results 


In this chapter, we present experimental results obtained using the PISCES simulator. In 
doing so, we attempt to demonstrate a number of points about the thread management 
problem in general, and about tree-based algorithms in particular. 

For each application and problem size, we first identify a “region of interest” of machine 
sizes. If a machine is in this range, it is large enough with respect to the application so 
that thread management is not trivially easy, but small enough to make it possible for an 
incremental increase in machine size to yield a significant decrease in running time. For 
most of the rest of this chapter, we will only look at results that fall into that region. 

We then show that the tree-based algorithms we have developed are competitive with 
a number of unrealizable “ideal” thread managers. The different idealized managers ignore 
different costs inherent to the thread-management task in order to identify the effects those 
costs have on overall performance. The most radical idealization, Free-Ideal, pays no 
communication or contention costs at all by scheduling threads on a single contention-free 
queue with zero thread enqueue and dequeue costs. In most cases, the tree-based algorithms 
get performances that are within a factor of three from Free-Ideal. The other idealized 
managers are usually within a factor of two of Free-Ideal. 

A comparison of realizable algorithms then shows that the tree-based algorithms we have 
developed are competitive with simpler algorithms on machines with 256 or fewer processors, 
and that for larger machines, the tree-based algorithms yield significant performance gains 


over the simpler algorithms. In particular, because of their simplicity, the Round-Robin 
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algorithms perform best of all real algorithms on small machines, but on larger machines, 
their performance suffers. The diffusion algorithms, on the other hand, seem to perform 
marginally worse than the others on machines with 256 or fewer processors; as machine size 
is increased, Diff-1 and Diff-2 perform very poorly with respect to the others. 

When a processor from which threads can be stolen is located, consumer-based thread 
managers have a choice of policies when determining how much work to steal. The two 
choices we examine are steal-one and steal-half. These options differentiate C-Ideal-1 
and C-Ideal-2 thread managers from each other, as well as RR-1 and RR-2. We were 
interested in the effects of this policy choice while XTM was under development because 
XTM implicitly follows the steal-half policy: it always attempts to evenly balance the 
workload between two branches of a tree when moving work from one branch to the other. 

We then compare the three candidate tree-based algorithms with each other. For the 
Alewife parameter set, we find that TTM (no nearest-neighbor links) performs better 
than XTM. XTM-C always performs poorly, despite the theoretical prediction of optimal 


behavior, for two reasons: 


1. The work estimates maintained at the tree nodes can be inaccurate, due to time delays 
inherent to the update process, inaccuracies built into the system to lower update 
costs, and, most importantly, the incorrect assumption that all threads represent the 


same amount of work. 


2. Maintaining work estimates in the tree carries significantly higher overhead than main- 
taining one-bit presence information. This added overhead results in correspondingly 


lower performance. 


It is interesting to see what happens when the processor speed is increased with respect 
to the network speed. As the processor speed is increased with respect to the speed of the 
communications network, effects that previously showed up on large machines running large 
problems begin to appear on smaller machines running smaller problems. Furthermore, on 
machines with faster processors, the locality gains inherent in XTM become more impor- 
tant, and XTM’s performance surpasses that of TTM. This will be especially relevant if 


current trends in technology continue, in which processor performance is going up faster 
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than network performance. In addition, cutting the processor cycle time gives us an inex- 
pensive way of investigating “large machine” behavior without paying for it in simulation 
cycles. 

Finally, we look at MATMUL, an application that demonstrates strong static data-task 
locality. When caches are simulated, performance depends heavily on good partitioning. 
The finely partitioned version fails to keep data locality within a single thread; in this case, 
none of the dynamic thread managers can recapture that locality. Conversely, the coarsely 
partitioned case keeps more accesses to the same data within each thread. For the coarsely 
partitioned version, the tree-based thread managers perform very nearly as well as any of 


the idealized dynamic managers, and almost as well as a good statically mapped version. 
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Figure 7-1: Experimental Parameters. 
7.1 Parameters 


When examining the performance of thread management algorithms, there are many pos- 
sible parameters that can be varied, as illustrated in Figure 7-1. We separate simulation 
parameters into three major groups: Machine, Thread Management Algorithm and Appli- 
cation. In the following subsections, we discuss each major group in turn. We also classify 


the parameters according to whether they will remain fixed or be varied. 


7.1.1 Machine Parameters 


This section discusses the various architectural parameters that go into a PISCES simula- 


tion. 
Machine Architecture: 
Processor 


Memory System 


Interconnection Network 
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Machine Size p: The number of processors in the machine. 


Network Speed ¢,,: The time, in cycles, that it takes one flit to travel from one switch to 


the next. 


Low-Level Software Costs: A number of low-level activities such as thread creation, 
thread suspension, etc. carry costs that are independent of the thread management 


algorithm. 


All of the above parameters were taken directly from the Alewife machine. All of these 
parameters except p and f,, are fixed throughout the simulations. Low-level software is 
assumed to be that of the prototype Alewife runtime system; the corresponding overheads 
were measured from NWO runs, and are given in Table 6.1. 

The two machine parameters that are varied are p and t,. We are interested in run- 
ning programs on machines of various sizes, so we vary p in order to study the behavior of 
different thread management algorithms as machines become large. We use ¢,, in the same 
manner: one way of simulating a “large” machine is to increase interprocessor communi- 
cation times. Increased t, gives the impression of increased machine size without taking 
correspondingly greater time to simulate. Furthermore, as the state of the prevailing tech- 
nology advances, the trend is moving towards faster and faster processors. Communication 
latencies are already near the speed of light and can not be reduced very far from current 
levels. Therefore, as processors continue to improve, we expect t,, to show a corresponding 


increase. 
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7.1.2 Thread Management Parameters 


In Chapter 6, we described a number of candidate thread management algorithms to be 
compared against one another. These were split into two groups: realizable and unrealizable. 
The applications listed above were each run using each of the thread managers, repeated 


here for completeness: 


Unrealizable 


Ideal: Optimal case, calculated based on single-processor running time and critical 


path calculations. 
Free-Ideal: Simulates a single zero-cost, zero-contention thread queue. 


P-Ideal: Simulates free instantaneous knowledge of the state of all processors’ thread 


queues. Threads are moved around by the processors that create them. 


C-Ideal-1: Simulates free instantaneous knowledge of the state of all processors’ 
thread queues. Threads are moved around by idle processors, using the “Steal- 


One” policy. 


C-Ideal-2: Same as C-Ideal-1 except that the “Steal-Half” policy is used. 
Realizable 


RR-1: A simple round-robin thread manager in which each processor scans every 


other processor for work, using the “Steal-One” policy. 
RR-2: Same as RR-1 except that the “Steal-Half” policy is used. 
Diff-1: Diffusion-based thread manager with no instabilities. 


Diff-2: Diffusion-based thread manager with a small instability; performs better than 
Diff-1. 


XTM: X-Tree algorithm as described in Chapter 4. 


TTM: Same as XTM except that no nearest-neighbor links are used: threads can 


only migrate between subtrees that share a parent. 
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XTM-C: Similar to XTM except that more accurate work estimates are maintained 
at the nodes. XTM maintains a single-bit work estimate (present or absent); 
this variant uses a multi-bit work estimate to determine whether the cost of a 


balance operation between two nodes justifies the expected gains. 


Stat: For certain applications, a near-optimal static schedule can easily be produced. 
The associated thread management algorithm does nothing but run the threads 


that are sent to each processor in a statically determined fashion. 


7.1.3 Application Parameters 


Data were taken for a number of applications and problem sizes, as described in Chap- 
ter 6. We list the chosen applications and corresponding problem sizes below. For a list of 


application-specific characteristics, see Table 6.2. 
Application 


AQ: Adaptive quadrature integration. 

TSP: Traveling salesman problem. 

FIB: Doubly-recursive Fibonacci. 

UNBAL: Synthetic “unbalanced” application. 


MATMUL: Matrix Multiply. 


Problem Size 


AQ: Sensitivity for convergence: ranges from 0.5 to 0.001. 

TSP: Number of cities in tour: ranges from 8 to 11. 

FIB: Calculates the n”" Fibonacci number: ranges from 15 to 25. 
UNBAL: Number of threads: ranges from 1024 to 65536. 


MATMUL: Matrix size n ([n x n] x [n x n] = [n x n]): ranges from 16 to 64 . 
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Application] 
1-16384 


all but Stat 


all but Stat 
all but Stat 


MATMUL 
(coarse, uncached) 


MATMUL 
(fine, cached) 


MATMUL 
(fine, uncached) 


Table 7.1: Experiments Performed (t, = 1): In all cases, RR-1 and RR-2 were tested for 
p < 4096, Diff-1 and Diff-2 were tested for p < 1024 and Coarse Uncached MATMUL with Diff-1 
and Diff-2 was tested for p < 256. For UNBAL(16), P-Ideal, C-Ideal-1 and RR-1 were tested for 
p < 1024; for UNBAL(32) and for UNBAL(64), they were tested for p < 256. For UNBAL(1024), 
Stat was tested for p < 1024; for UNBAL(4096), Stat was tested for p < 4096. 


7.2 Experiments 


For each application, data were taken over a range of machine sizes, problem sizes, ratios of 
processor speed to network speed, and thread management algorithms. Tables 7.1 and 7.2 
describe the resulting (nearly complete) cross product. For a listing of raw experimental 
results, see Appendix C. 

Note that in all cases, RR-1 and RR-2 were only tested for machine sizes up to 4096 
processors, and Diff-1 and Diff-2 were only tested for machine sizes up to 1024 processors. 
These thread managers performed poorly on large machines; consequently, simulations of 


machines larger than these maximum sizes took too long to be practical. 
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Application Problem Size [ta |p Range | __Managers__——_—d 


1-16384 all but Stat 
1-16384 

1-16384 

1-4096 


1-16384 all but Stat 
1-16384 

1-16384 

1-4096 

1-4096 all but XTM-C and Stat 


all but Stat 


7 
MATMUL 64 all but Ideal and XTM-C 
(coarse, cached) 

MATMUL 64 all but Ideal and XTM-C 

(coarse, uncached) 


MATMUL 64 1-1024 all but Ideal and XTM-C 
4 


1-1024 
1-1024 


MATMUL 6 all but Ideal and XTM-C 
(fine, uncached) 


Table 7.2: | Experiments Performed (Variable ¢,.): For each application, a suitable problem size 
was selected; for that problem size, t, was then varied. In all cases, RR-1 and RR-2 were tested 
for p < 4096, Diff-1 and Diff-2 were tested for p < 1024 and Coarse Uncached MATMUL with 
Diff-1 and Diff-2 was tested for p < 256. Furthermore, for UNBAL(64), P-Ideal, C-Ideal-1 and 
RR-1 were tested for p < 256. 


(fine, cached) 
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7.3 Machine Size and Problem Size - “Regions of Interest” 


For each application and problem size, for the purpose of evaluating thread management 
algorithms, there exists a “region of interest” of machine sizes. Figures 7-2 through 7-5 
display performance curves for a number of applications. All graphs in this chapter are log- 
log plots of machine performance (cycles) against p (number of processors) for a number 
of different thread managers. Virtually all of these curves have the same general shape: 
for small machine sizes, they are nearly linear (running time is inversely proportional to 
p); for large numbers of processors, they are nearly flat (running time stays constant with 
increasing p). Somewhat inaccurately, we term this flat region the saturation region for a 
given performance curve. 

For a given application and problem size, a machine size is “interesting” if near-linear 
speedup is possible, but not trivial, to obtain, on a machine of that size. More specifically, 
for a given application and problem size, a machine size is in the region of interest if it is 
large enough so that thread management is not trivially easy, but small enough to make 
it possible for an incremental increase in machine size to yield a significant decrease in 
running time for some thread management algorithm. The region of interest is that range 
of machine sizes for which all performance curves of interest undergo the transition from 
linear to saturated. Good thread managers achieve nearly linear performance throughout 
most of the region, only reaching saturation towards the right hand size of the region (larger 
p). Bad thread managers, on the other hand, saturate near the left hand side of the region 
(smaller p). For most of the rest of this chapter, we will only look at results for machine 
sizes that are in this region. 

Figures 7-2 through 7-5 give the regions of interest for AQ, FIB, TSP and UNBAL. 
In addition to displaying five performance curves, each graph contains two vertical dotted 
lines, which mark the range of p that makes up the region of interest. Notice how the 
region of interest moves to the right (larger machines) as problem size increases. We give 
the entire progression for AQ (see Figure 7-2), and the endpoints of the progressions for 
FIB (see Figure 7-3), TSP (Figure 7-4) and UNBAL (Figure 7-5). The region of interest 


data are summarized in Table 7.3. 
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Figure 7-2: 
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AQ: Regions of Interest. 
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Figure 7-3: | FIB: Regions of Interest. 
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Figure 7-4: TSP: Regions of Interest. 
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Figure 7-5: UNBAL: Regions of Interest. 
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Application 
aBrI91 | 1219 | 4-256 
Po a5 [153 [16-1024 
0.05 [4,169,108 [2,943 [16-1024 


0.01 [20,201,061 | 14,269 [64-4096 
pl 0.005 | 43,062,592 30,417 | 64-16384 
ay 0.001 | 213,632,152 150,049 [| 256-16384 


5 652,439 71024 
0 7,244,462 13,529 | 16-4096 

5 | 80,350,904 150,889 | 64-16384 
0 

1 


EE 16~4096 


0 ~) 0 — 
024 
PT a0 [20715 | _4,096 [16-4096 _| 
638 [9,617,771 | 16,384 | 6416380 | 
[685536 [38,469,995 | 65,636 | 6416380 | 


Table 7.3: Regions of Interest: T is the running time in cycles on one processor. nz is the number 


— 
dt 
——7 
—— 
sd 


of threads needed for a given run. pjnz¢ is the range of p that makes up the region of interest. Note 
that the upper limit of a given region of interest is limited by the maximum number of processors 
on which that particular experiment was run. 


The data presented in these figures leads us to the following conclusions: 


1. In all cases, to the left of the region of interest, there is very little difference between 
the various thread management algorithms shown. In particular, XTM performs as 
well as the others in this region. This is not surprising, since to the left of the region of 
interest, any thread management algorithm should do pretty well as long as it doesn’t 


add much overhead. 


2. In the region of interest, the quality of the various thread managers becomes apparent. 
In this region, for large applications, “good” thread managers (e.g., X TM) achieve 
near-linear speedup over most of the range, while “bad” thread managers (e.g., Diff- 
1 and Diff-2) perform poorly, with run times that remain constant or increase as p 


increases. 


3. To the right of the region of interest, all thread managers reach saturation. In some 
cases, adding processors actually decreases performance; in all cases, adding processors 


does not yield significant performance gains. 
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Figure 7-6: AQ: XTM vs. P-Ideal, C-Ideal-1 and C-Ideal-2. 
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Figure 7-7: FIB: XTM vs. P-Ideal, C-Ideal-1 and C-Ideal-2. 


7.4 Comparing Tree-Based Algorithms to Unrealizable Al- 


gorithms 


The tree-based algorithms we have developed exhibit performance that is competitive with 
a number of unrealizable “ideal” thread managers. The most idealized of these is Free- 
Ideal. This manager pays none of the communication or contention costs inherent to thread 
management, using a single contention-free queue with zero thread enqueue and dequeue 
costs. This gives a lower bound on the achievable running time when there are no inter- 
thread dependencies. For all but one of the applications tested here, such dependencies do 


exist, but a simple heuristic seems to get near optimal performance in all cases: run threads 
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TSP: XTM vs. P-Ideal, C-Ideal-1 and C-Ideal-2. 
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UNBAL: XTM vs. P-Ideal, C-Ideal-1 and C-Ideal-2. 


on a single processor in a depth-first fashion and put all reenabled threads on the head of 
the running queue. 

The other idealized thread managers pay communication costs for moving threads and 
values around, but have “free” information on the state of the system at all times. The 
producer-oriented version (P-Ideal) uses this information to “push” newly created threads 
from the creating processor to destinations chosen by distance and queue length. The 
consumer-oriented versions (C-Ideal-1 and C-Ideal-2) “pull” threads from the nearest 
non-empty queue. When P-Ideal sends a thread to a queue, it makes a note of the fact 
so that other processors realize that the queue is about to get longer and act accordingly. 
When C-Ideal-1 or C-Ideal-2 decides to take one or more threads off a queue, it makes a 
similar note so that other processors know that those threads are “spoken for.” 

The relation between P-Ideal to C-Ideal-1 and C-Ideal-2 is that of eager to lazy. 
When the delay between when a thread is created and when it is sent to its final destination 
is most relevant to achieving good performance, P-Ideal does well. However, in general, 
the later a decision is made, the more information is available and the better the choice. 
Therefore if the inaccuracy of choice inherent to P-Ideal is a major factor, C-Ideal-1 and 
C-Ideal-2 will perform better. For the applications tested, dispatch time was slightly more 
important than accuracy of choice, so P-Ideal usually beat out C-Ideal-1 and C-Ideal- 
2by a slight margin. Figure 7-9 is the exception, again by a small margin. 

In all cases, Free-Ideal achieved the best performance, followed by P-Ideal and then 
C-Ideal-1 and C-Ideal-2. On large numbers of processors, The idealized thread managers 
outperformed XTM by a factor of 1.5 to four. It seems that the primary advantage the 
idealized managers have over XTM stems from the availability of free, completely up-to- 
date global information; the real thread managers only have the information that is made 
available to them. This information is in general both late and inaccurate. 


From the data presented in Figures 7-6 through 7-9, we derive the following conclusions: 


1. There doesn’t seem to be much difference between eager and lazy decision-making for 


the applications we tested. 


2. The costs inherent to the collection and dissemination of global information can be 


prohibitively high. 
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Finally, note that in Figure 7-9, the P-Ideal and C-Ideal-1 results saturate very early 
in the region of interest. This is due to a serialization inherent to those algorithms (see 


Section 7.6). 
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Figure 7-11: FIB: XTM vs. Diff-1, Diff-2, RR-1 And RR-2. 


7.5 Comparing Tree-Based Algorithms to Other Realizable 


Algorithms 


A comparison of realizable algorithms speaks favorably for XTM. Figures 7-10 through 7-13 
compare XTM, Diff-1, Diff-2, RR-1 and RR-2, using Ideal and Free-Ideal as baselines. 
Note that for small problem sizes, whose regions of interest cover relatively small machine 
sizes, there is little to choose from between the various thread managers. However, for larger 
problem sizes, a significant performance difference between the thread managers begins to 


appear. 
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Figure 7-13: UNBAL: XTM vs. Diff-1, Diff-2, RR-1 And RR-2. 
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RR-1 and RR-2 run well on small machines. Their simple structure requires no ad- 
ditional overhead on any thread creation or consumption activities; no distributed global 
data structures need to be updated. As long as communication costs remain low, Round- 
Robin is fine, but as machine size increases, RR-1 and RR-2 place requirements on the 
communication system that it can’t fill. In particular, if when running a Round-Robin 
thread manager, a processor queries five other processors before finding work, then as long 
as that query time is small compared to the thread loading, running and termination times, 
the Round-Robin manager will achieve good performance. As machine sizes increase, the 


latency of each query goes up, and eventually the query time dominates the performance. 


Diff-1 and Diff-2 perform slightly worse than the others on small machines, and very 
poorly on large machines. There are several reasons for this. Most important, the need for 
algorithmic stability in the relaxation leads to a “minimum slope” in the load on adjacent 
processors: when the load on two neighboring processors differs by two or less, no exchange 
takes place on a relaxation step. This means that for a p-processor square machine on a 
2-D mesh, 2p,/p threads are needed to fill the machine, not p as one would hope for on a 


p-processor machine. This is the main reason that Diff-1 performs poorly. 


This problem is fixed in Diff-2, at the price of a small instability: the constants are 
set in such a way that a single thread can bounce back and forth between two processors 
on alternate diffusion steps. This same set of constants yields a “minimum slope” of 0, 


eliminating the requirement for 2p,/p to fill the machine. 


Unfortunately, even when there is no such interaction between integer queue lengths 
and stability requirements, the relaxation time is still © (/?), where | is the diameter of 
the communications network [3]. This is very slow compared to the tree algorithms, which 
balance neighboring tree nodes in time proportional to the distance between the nodes, not 


the square of that distance. 


Finally, the overhead of unneeded relaxation steps slows down all processors some fixed 
amount. This amount varies from one or two percent on four processors to more than 75 
percent on a large machine with high ¢,. This variation results from the fact that the 
cost of a diffusion step depends on the communication time between a processor and its 


nearest neighbors. For these reasons, even a perfectly balanced application achieves inferior 
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performance under Diff-land Diff-2. 


Figures 7-10 through 7-13 lead us to the following conclusions: 


1. The Round-Robin algorithms perform best of all real algorithms on machines of 256 


or fewer processors, but their performance saturates early on larger machines. 


2. XTM is competitive with simpler algorithms on small machines. On larger ma- 
chines, XTM continues to achieve speedup, outperforming the Round-Robin thread 


managers by a large margin in some cases. 


3. The Diffusion algorithms perform marginally worse than the others on small machines. 
As machine size is increased, Diff-1 performs extremely poorly with respect to the 
others. Diff-2 achieves somewhat better results than Diff-1, but is still significantly 


inferior to the others. 


Again, as in the previous section, note that in Figure 7-13, the P-Ideal and C-Ideal-1 
results saturate very early in the region of interest. This is again due to a serialization 


inherent to those algorithms (see next section). 
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7.6 Steal-One vs. Steal-Half 


When a processor from which threads can be stolen is located, consumer-based thread 
managers have a choice of policies when determining how much work to steal. The two 
choices we examine are steal-one and steal-half, they differentiate C-Ideal-1 and C-Ideal- 
2, as well as RR-1 and RR-2. For non-pathological applications, either choice can win 
out. However, when the initial load is severely unbalanced, as is the case with UNBAL, the 
performance of steal-half far exceeds that of steal-one (see Figure 7-14), due to serialization 


on the producer side. 


For the UNBAL results shown, the poor performance curves for C-Ideal-1 and RR-1 
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are a direct result of the fact that only one thread at a time is taken from a processor whose 
queue contains runnable threads. Since UNBAL initially places all runnable threads on one 
processor, that processor becomes a bottleneck in the thread distribution process. It takes 
time for each thread-moving operation to run; if each thread is handled separately, then 
the rate at which threads can migrate from that processor to the rest of the machine is 
limited to the inverse of the amount of time it takes to move one thread off that processor. 
This “serialized” behavior puts a hard upper limit on the number of processors that can be 
effectively used. 

Most applications do not display the kind of pathological behavior shown in UNBAL. 
All threads do not represent the same amount of work, and it is rare that a large number 
of small threads end up clustered in a very small region of a machine. However, UNBAL 
represents an extreme case of a machine whose load is severely unbalanced, and results of 
UNBAL runs give some insight into how a thread management scheme will behave in the 
presence of an unbalanced load. 

For the other applications we simulated, the choice between steal-one and _steal-half 
made very little difference. In the FIB graph in Figure 7-14, RR-1 achieved slightly better 
results than RR-2; in the TSP graph, RR-2 slightly outperformed RR-1. In both cases, 
there was no discernible difference in performance between C-Ideal-1 and C-Ideal-2. 

The tree-based algorithms are based on a policy that attempts to balance neighboring 
tree nodes evenly whenever one of the neighbors becomes empty. This section justifies this 
choice independently of the use of tree-type data structures. In particular, this section gives 
a case in which an uneven load balance will cause thread managers that employ a “steal- 
one” policy to exhibit serialized behavior. In other cases, the choice between steal-half and 


steal-one doesn’t seem to be particularly important. 
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Figure 7-16: FIB: Comparing Various Tree-Based Algorithms. 


7.7 Differences Between Tree-Based Algorithms 


Three variants on a tree-based thread management theme were explored: TTM, XTM 
and XTM-C. TTM gives the best results for machines as large as the ones we measured, 
with t,= 1. XTM-C performs very poorly on all but very slow networks for the following 


reasons: 


1. The work estimates maintained at the tree nodes can be inaccurate, due to time 
delays inherent to the update process, inaccuracies built into the system to lower 
update costs, and, most importantly, the incorrect assumption that all threads are 


leaves in the application’s task tree. 
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2. Maintaining work estimates in the tree carries significantly higher overhead than main- 
taining one-bit presence information. This added overhead results in correspondingly 


lower performance. 
The data presented in Figures 7-15 through 7-18 suggests the following conclusions: 


1. For the match between processor speed and network speed in Alewife (t,;= 1), TTM 
without the X-Tree’s nearest-neighbor links is preferable, at least for machines con- 
taining up to 16,384 processors, which were the largest we could simulate. For a 


discussion of the effect of increasing t,,, see Section 7.8. 


2. XTM-C never performs particularly well due to overhead costs and inaccurate weight 


estimates. 
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7.8 Faster Processors 


Most of the data presented in this thesis assumes t,= 1. Current technological trends 
suggest that in the future, the ratio between processor speed and network speed will become 
quite high. Since this research is primarily concerned with a world in which large-scale 
multiprocessors are commonplace, it is only natural that we should investigate a situation 
in which processors are very fast with respect to the interconnection network. 

In addition, even PISCES simulations are orders of magnitude slower than actually 
running real programs on real hardware. This is compounded by the fact that simulating 
a p-processor multiprocessor on a uniprocessor is further slowed by a factor of p. When 
investigating the behavior of large machines, it would be very nice if simulations of smaller 
machines could in some way “fake” large-machine behavior. Artificially scaling the commu- 
nication latency by increasing t,, seems to have just that effect, at least for the purposes of 
this thesis. 

Figures 7-19 through 7-22 show the effect of increasing t,, from one cycle to 64 cycles. 
In each case, note that as the network slows down, the qualitative differences between the 
thread managers become more apparent on smaller machines. In particular, the drawbacks 
of Diff-1, Diff-2, RR-2 and XTM-C show up more clearly. More interesting is the relation 
between TTM and XTM. As t,, is increased, the gains due to XTM’s near-neighbor links 
become more important, and the X'TM’s performance surpasses that of TTM. 

Another item of interest appears in Figure 7-20. Even for the case where t,= 1, X TM- 
C’s performance takes a sharp dive on more than 1024 processors. For t,= 8, the perfor- 
mance degradation begins at 256 processors, and for t,= 64, performance was so bad that 
it was impractical to simulate. This poor performance results from the fact that XTM-C 
will not balance the load between two neighboring nodes unless the cost, which is measured 
as a function of the communication time between the two nodes, is outweighed by the ad- 
vantage, which is predicted as a function of the amount of work the manager thinks is on 
the two nodes. For FIB in particular, these work estimates are poor, due to the system’s 
lack of knowledge about how threads create other threads. It is therefore often the case 
that the XTM-C does not balance the workload on neighboring nodes when it should have 


done so. Note, however that when the manager knows about all the work in the system, as 


128 


in Figure 7-22, XTM-C’s performance surpasses that of XTM for t,= 64, by the slimmest 
of margins. 

A final item of interest that occurs for large t,, concerns Diff-1 and Diff-2. In all cases, 
the running time for the two Diffusion schedulers is actually longer on four processors than 
on one processor when t,= 64. This is because on a machine with more than one processor, 
the cost of a diffusion step depends on t,. When ¢,, is large, this overhead overwhelms the 
performance gains when going from one processor (no external communication takes place) 
to four processors (external communication takes place on every diffusion cycle). 

The most important lesson to learn from the data presented in Figures 7-19 through 7-22 
is the confirmation of asymptotic analysis. In all our analyses, we assumed that interpro- 
cessor communication is the dominating factor in large-scale system performance. When 
we adjust the ratio between computation and communication speeds so that this is the case 
for the machines we examined, the thread managers that yield good theoretical behaviors 


also yield good simulated behavior. 
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7.9 MATMUL: The Effects of Static Locality 


Finally, we look at MATMUL, a simple matrix multiply application. The two aspects of 
MATMUL performance we were most interested in were the effects of caching and the effects 
of different partitioning strategies. We therefore studied four cases: coarse-grained cached, 
fine-grained cached, coarse-grained uncached and fine-grained uncached. The data from 
these cases are presented in Figures 7-23 through 7-30. 

When we say that MATMUL demonstrates strong static locality, we mean the following. 
In both the coarse-grained case and the fine-grained case, each thread accesses certain data 
elements in each matrix. The blocked algorithm breaks each matrix into a number of sub- 
blocks, which are spread out over the processor mesh in the obvious way. The overall 
multiply algorithm is then decomposed into sets smaller matrix multiplies, each of which 
finds the product of two sub-blocks (see Figures 6-8 and 6-9). Each thread accesses certain 
data elements of each matrix, in many cases more than once. A thread that accesses a block 
of data that resides on a given processor will run faster if it runs on or near to the processor 
on which its data is resident. Since the data is allocated in a static fashion, we say that the 
application demonstrates static locality. 


Halstead and Ward[27] define locality of reference as follows: 


Reference to location X at time t implies that the probability of access to loca- 


tion X + AX at time t + At increases as AX and At approach zero. 


Caching is one mechanism that takes advantage of locality: when a remote data item is 
cached, multiple accesses to the item only pay the remote access cost once. Clearly, the 
behavior of an application that demonstrates locality will be strongly affected by the caching 
strategies of the machine on which the application is run. Since MATMUL is interesting 
primarily for its locality-related behavior, we decided to look at its running characteristics 
both in the presence and the absence of caches. 

Except for Stat, all the thread management strategies we tested are dynamic, and can 
make use of no specific information about individual threads. Consequently, the managers 
have no way to make use of locality characteristics tying specific threads to specific proces- 


sors, other than general heuristics that try to keep threads close to their point of origin. 
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For this reason, in the uncached case, all potential locality-related performance gains are 


inaccessible to the thread managers. 


Two partitioning strategies for MATMUL are described in Chapter 6. The coarsely 
partitioned approach exhibits strong locality-related ties between each thread and the parts 
of the matrices it reads and writes. The finely partitioned approach loses much of that 
locality, but creates more threads, giving thread managers more flexibility. As might be 
expected, when caches are simulated, the coarsely-partitioned version runs much faster 
than the finely-partitioned version. For thread managers that exhibit load-sharing problems 
(diffusion algorithms, for example), the extra parallelism in the finely-partitioned version 


was necessary in order to avoid disastrous performance degradations. 


Coarse Partitioning, with Caches 


We now examine the details of the coarsely-partitioned version of MATMUL, with caches 
(see Figures 7-23 and 7-24). In this case, there is very little separation between the Stat, the 
idealized managers (Free-Ideal, C-Ideal-1 and C-Ideal-2) and the tree-based managers 
(TTM and XTM). On large machines, the performance of the round-robin managers (RR- 
1 and RR-2) begins to suffer. The diffusion managers (Diff-1 and Diff-2) perform poorly 


for all problem sizes, machine sizes and network speeds. 


Fine Partitioning, with Caches 


For the finely-partitioned case with caches, the loss of locality due to the fine partitioning 
hurts the performance of all managers with respect to Stat(see Figures 7-25 and 7-26). 
The separations between managers observed for the coarsely-partitioned case nearly dis- 
appears, although the tree-based managers still perform marginally better than the other 
realizable managers. The managers’ performance curves begin to separate out for large ty, 
but the Statalways performs about twice as well as its nearest rival, primarily due to lower 


communication costs due to better locality behavior. 
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Coarse Partitioning, without Caches 


Without caches, the system can’t make use of the higher degree of locality in the coarsely- 
partitioned version (see Figures 7-27 and 7-28). However, the advantages of that locality 
are mostly lost to Stat, since local cache misses also carry a significant expense, so the 
results are similar in nature to the cached case, although the gap between Stat and the 


others is significantly larger than in the uncached case. 


Fine Partitioning, without Caches 


As in the finely-partitioned cached case, the finely-partitioned uncached case doesn’t show 
much separation between the various thread managing strategies (see Figures 7-29 and 7- 
30). As expected, the diffusion algorithms perform poorly and for large machines, the 


round-robin managers do worse than the others. 


MATMUL gives some insight into the behaviors of the various candidate thread man- 
agers when locality is an issue. Since a near-optimal static schedule can be derived from the 
regular structure of the application, the Statalways outperforms the other managers by a 
discernible margin. However, when a coarse partitioning strategy is used and when caches 
are available to recapture most of the locality inherent to the application, TTM and XTM 
perform nearly as well as Stat, as do Free-Ideal, C-Ideal-1 and C-Ideal-2. 
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Chapter 8 


Conclusions and Future Work 


In the process of studying the behavior of XTM and other thread-management algorithms, 
we have learned a number of lessons, by both analytical and empirical means. Most im- 
portantly, we found that on small machines, there is no need to do anything clever about 
thread management: almost anything that avoids hot-spot behavior and that doesn’t im- 
pose a high overhead will perform well. The results in Figures 7-2 through 7-5 point this 
out most clearly: for machines smaller than those in the “regions of interest,” there is very 
little difference between X TM, one of the best realizable algorithms we tested, and Diff-1, 
one of the worst. The same figures show that on small machines, the added complexity 
of the tree-based algorithms doesn’t cost very much; the tree-based thread managers work 
nearly as well as any of the others even where their added sophistication is not needed. 
We also found that on large machines, communication locality becomes very important. 
One way to achieve lower communication costs is to use a message-passing style of computa- 
tion, which is possible for well-understood statically structured algorithms. Chapter 5 give 
asymptotic cost arguments in favor of XTM to this effect. Second, the MATMUL results 
given in Chapter 7 show that when locality can be exploited to lower communication costs, 
it can lead to better program behavior. The fact that Stat runs were always the fastest 
points this out; the coarsely-partitioned case with caching also gives evidence to this effect. 
Section 7.6 demonstrated that that parallel algorithms for large machines must avoid 
hot-spot behavior, or else risk losing the benefits of large-scale parallelism. Therefore, 


the tree-based thread-management algorithms presented in this thesis are all fully dis- 
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tributed, employing combining techniques for the collection and distribution of the threads 
being managed and of the global information needed to achieve good thread management. 
Other candidate thread-management algorithms that contained serialized behavior suffered 
severely when that serialized behavior became important (see Figure 7-14, for example). 

We also learned that thread management is a global optimization problem. A good 
thread manager must therefore achieve efficient collection and distribution of relevant global 
information. One of the reasons that TTM and XTM work as well as they do is that they 
collect and distribute enough global information in order to make well-informed thread 
management decisions, without paying an excessive price for that information. Diff-1, 
Diff-2, RR-1 and RR-2 do not make use of global information and their performance 
suffers accordingly. 

Although our analytical results point out the need for locality to minimize communi- 
cation costs on large machines, it seems that for the particular set of parameters used for 
most of our simulations, the added locality gained by XTM’s near-neighbor links doesn’t 
pay for the higher costs associated with passing presence information over those links, as 
compared with TTM. However, as as processor speeds increase with respect to network 
speeds, locality becomes more important and XTM performance surpasses that of TTM, 
as shown in Figures 7-19 through 7-22. More generally, as processor speed increases with 
respect to communication speed, “theoretically sound” thread management methodologies 


become necessary, even on smaller machines. 


8.1 Future Work 


There are a number of issues relevant to this research that we have left unresolved in 
this thesis. The first involves optimality proofs for the tree-based thread managers. In 
this thesis, our approach was to verify the good behavior of the overall algorithm through 
simulation. From the outset, we assumed that although we could analyze pieces of thread 
management algorithms with suitable approximations, when the pieces were assembled into 
an entire system, only simulations could give us insight into their behaviors. We have reason 
to believe, however, that we can make stronger formal statements about the behavior of 


our algorithms. It seems that the tree algorithms as a whole might be provably polylog- 
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competitive with the optimal: we intend to continue work in that area. 


It would be useful to study how inaccuracy in information kept in the tree affects our 
analytical results. In general, the information kept in the tree is out-of-date: it takes time 
to migrate up from the leaves to higher-level nodes. In all our analyses, we assumed the 
information in the tree was accurate; it would be interesting to explore the effects of the 
potential inaccuracies. Along the same lines, we assumed, for analytical purposes, that all 
interprocessor messages were the same length. In fact, messages that carry more information 
(e.g., a number of threads to be moved from one area of the machine to another) are different 


lengths. It would be useful to see what the varying of message lengths does to our results. 


In Chapter 5, we give bounds on the cost of the update algorithms. These bounds are in 
some sense worst-case and best-case. The worst-case cost assumes there are very few threads 
on queues in the machine; the best-case cost is an expected cost given a sufficiently high, 
balanced workload. It would be useful to explore the region between these two extremes. 
For example, what kind of behavior is achieved if only a section of the machine has a 


sufficiently high workload or if we relax the balance criterion to some degree? 


It would be nice to verify our simulation results on actual machines. Perhaps the 128- 
processor CM-5 recently purchased by the lab would be useful for that purpose, or maybe a 
128-processor Alewife machine that is planned to be built sometime next year. Although by 
our predictions, in many cases, 128 processors is not large enough to be interesting, perhaps 
we can observe some of the predicted trends beginning to occur. Furthermore, if we can 
figure out how to artificially increase t,, 128 processors may be big enough to encounter 


large-machine behavior. 


It would be interesting to pay more attention to the effects of locality inherent to the 
applications being managed. Although the algorithms that make up XTM have good 
locality, the applications being managed are kept local inasmuch as XTM tries to keep 
threads close to their point of origin; other than that, no attention is paid to locality in the 
application. It would be interesting to study how well this heuristic works on an application 


that carries a higher degree of inter-thread communication. 


Finally, this work assumed no communication path from the compiler to the runtime 


system. In some cases, the compiler should be able to distill information about the running 
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characteristics of a given application that the runtime system can use. What form would 
that information take, such that the compiler can extract it and the runtime system can put 
it to work? Perhaps some work on compiler-generated annotations of individual threads or 


profile-driven compilation would be appropriate for this purpose. 
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Appendix A 


Presence Bit Update Algorithm 


The following pseudocode gives a formal statement of the presence bit update algorithm. 


Update_Presence_Bit (Node) 
{ 
if ¢ (Node_Presence_Bit(Node) == 0) 
&K ( Node_Is_Leaf_Node(Node) 
&& Thread_Queue_Not_Empty (Node) ) 
|| ©  Node_Not_Leaf_Node(Node) 
&& Child_Presence_Bits_Not_All_Zero(Node) ))) 


Node_Presence_Bit (Node) = 1; 

Update_Neighbor_Presence_Bit_Caches(Node, 1); 

if (Node_Has_Parent (Node) ) 
Update_Parent_Presence_Bit_Cache(Node, 1); 
Update_Presence_Bit (Node_Parent (Node) ) ; 


else if ( (Node_Presence_Bit(Node) == 1) 
&K ( Node_Is_Leaf_Node(Node) 
&& Thread_Queue_Is_Empty (Node) ) 
1] €  Node_Not_Leaf_Node(Node) 
&& Child_Presence_Bits_Are_All_Zero(Node)))) 


Node_Presence_Bit (Node) = 0; 

Update_Neighbor_Presence_Bit_Caches(Node, 0); 

if (Node_Has_Parent (Node) ) 
Update_Parent_Presence_Bit_Cache(Node, 0); 
Update_Presence_Bit (Node_Parent (Node) ) ; 
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Appendix 


B 


Application Code 


B.1 AQ 


;; See pgms/aq/faq.c for documentation. 


(herald FAQ) 


(DEFINE (MAIN LEV) 
(set *task-cycles* 1740) 


(LET ((TOL (CASE LEV 
0. 


((8) 
((7) 
((6) 
((5) 
((4) 
((3) 
((2) 
((1) 


0 
0 
0 
0. 
0 
5 
1 


.OE-4) 
(ELSE 5.0E-5)))) 


Adapted from SemiC output. 


(AQ 0.0 0.0 2.0 2.0 TOL (Q 0.0 0.0 2.0 2.0)))) 


(DEFINE (Q XO YO X1 Y1) 


(LET ((DX (- X1 X0)) 
(DY (- Y1 YO)) 
(XM (/ (+ XO X1) 2.0)) 
(YM (/ (+ YO Y1) 2.0))) 


(/ (* G (F XO YO) (F XO Y1) (F X1 YO) (F X1 Y1) 


(* 2.0 (F XM YM))) 


DX DY) 


6.0))) 


(DEFINE (F X Y) 


(LET* ((RO (* X Y)) 
(Ri (* RO RO))) 


(* R1 R1))) 
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(DEFINE (AQ XO YO X1 Y1 TOLERANCE QO) 
(LET* ((XM (/ (+ XO X1) 2.0)) 
(YM (/ (+ YO Y1) 2.0)) 
(Qi (Q XO YO XM YM)) 
(Q2 (Q XM YO X1 YM)) 
(Q3 (Q XO YM XM Y1)) 
(Q4 (Q XM YM X1 Y1)) 
(SUM (+ Q1 Q2 Q3 Q4))) 
(IF (< CIF (> SUM QO) (- SUM QO) (- QO SUM)) TOLERANCE) 
(BLOCK (PAUSE 1000) SUM) 
(LET* ((TOLERANCE (/ TOLERANCE 4.0)) 
(SUMO (BLOCK (PAUSE 1220) 
(FUTURE (AQ XO YO XM YM TOLERANCE Q1)))) 
(SUM1 (BLOCK (PAUSE 80) 
(FUTURE (AQ XM YO X1 YM TOLERANCE Q2)))) 
(SUM2 (BLOCK (PAUSE 80) 
(FUTURE (AQ XO YM XM Y1 TOLERANCE Q3)))) 
(SUM3 (BLOCK (PAUSE 80) 
(FUTURE (AQ XM YM X1 Y1 TOLERANCE Q4)))) 
(VAL (BLOCK (PAUSE 100) 
(+ (TOUCH SUMO) (TOUCH SUM1) 
(TOUCH SUM2) (TOUCH SUM3))))) 
(PAUSE 100) 
VAL)))) 


145 


B.2 FIB 


(herald ffib) 


(define (ffib n) 
(if (fx<= n 2) 
(block 
(pause 60) 
1) 
(let* ((lhs (block 
(pause 62) 
(future (ffib (fx- n 1))))) 
(rhs (block 
(pause 41) 
(future (ffib (fx- n 2))))) 
(val (block 
(pause 229) 
(fx+ (touch lhs) (touch rhs))))) 
(pause 66) 
val))) 


(define (main x) 


(set *task-cycles* 398) 
(ffib x)) 
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B.3 TSP 


333; Branch-and-bound TSP solver. 


333 Objective: find shortest tour of all N cities starting at city 0. 


(herald tsp) 


(define (main n ordered?) 
(set *task-cycles* (* 300 n)) 
(if (fx< n 1) 
(error "TSP solver only works with 1 or more cities!")) 
(let ((initial-path (make-path n)) 
(first-guess (make-path n))) 
(dotimes (i n) (set (path-element first-guess i) i)) 
(set (path-element initial-path 0) 0) 
(let (Cic-d-mat (cond ((null? ordered?) 
(vref unordered-ic-d-mats n)) 
((eq? ordered? ’#t) 
(vref ordered-ic-d-mats n)) 
(else (vref opt-ordered-ic-d-mats n)))) 
(cities (cond ((null? ordered?) (nth unordered-cities n)) 
((eq? ordered? ’#t) (nth ordered-cities n)) 
(else (nth opt-ordered-cities n))))) 
(init-best-so-far first-guess ic-d-mat cities) 
(let* ((s-path 
(find-shortest-path initial-path 1 n ic-d-mat cities)) 
(s-path (if (null? s-path) first-guess s-path))) 
(message (format nil "Shortest Path: ~s" 
(output-path s-path cities))) 
(message (format nil "Shortest Len : ~d" 
(path-length s-path ic-d-mat))) 
s-path)))) 
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393 
333 Main recursive path finder: returns #f if no path shorter than 
333 current best along current path. 


»>99 


(define (find-shortest-path path-so-far step-index n-cities ic-d-mat 
cities) 
(cond ((fx>= (path-length path-so-far ic-d-mat) (best-so-far)) 

(pause 50) 

#f) 

((f£x>= step-index n-cities) 

(message (format nil "New Best Path: ~s" 

(output-path path-so-far cities))) 
(message (format nil "New Best Len : “~s" 
(path-length path-so-far ic-d-mat))) 

(set (best-so-far) (path-length path-so-far ic-d-mat) ) 
(pause 50) 

path-so-far) 
(else 

(pause 50) 

(iterate loop ((paths ’()) (next-city 0)) 

(cond ((fx>= next-city n-cities) 
(pause 50) 
(select-best-path paths ic-d-mat)) 
((city-part-of-path next-city path-so-far) 


(pause 50) 
(loop paths (fx+ next-city 1))) 
(else 


(let ((new-path (copy-path path-so-far))) 
(set (path-element new-path step-index) next-city) 
(pause 200) 
(loop (cons 
(future 
(find-shortest-path new-path 
(fx+ step-index 1) 
n-cities 
ic-d-mat 
cities)) 
paths) 
(fx+ next-city 1))))))))) 


148 


(define (select-best-path paths ic-d-mat) 
(iterate loop ((best-path ’#f) (best-path-length ’#f) (paths paths)) 
(if (null? paths) 
best-path 
(let ((candidate (touch (car paths)))) 
(cond ((null? candidate) 

(pause 50) 

(loop best-path best-path-length (cdr paths) )) 

((null? best-path-length) 

(pause 50) 

(loop candidate 
(path-length candidate ic-d-mat) 
(cdr paths) )) 

(else 

(let ((current-length 

(path-length candidate ic-d-mat))) 
(pause 100) 
(if (fx< current-length best-path-length) 

(loop candidate current-length (cdr paths) ) 
(loop best-path best-path-length (cdr paths)))) 

))))) 
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355 
333; Best-so-far abstraction 


(lset *best-so-far* ’#f) 


(define (init-best-so-far first-guess ic-d-mat cities) 
(let ((len (path-length first-guess ic-d-mat))) 
(message (format nil "Initial Best Path: ~s" 
(output-path first-guess cities) )) 
(message (format nil "Initial Best Len : ~s" len)) 


(set *best-so-far* (make-vector *N-Processors* len)))) 


(define-constant best-so-far 
(object (lambda () 
(vref *best-so-far* *my-pid*) ) 
((setter self) 
(lambda (len) 
(broadcast (row col) 
(when (fx< len (best-so-far)) 
(set (vref *best-so-far* *my-pid*) len))))))) 
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355 
333; PATH abstraction 


»>99 


(define-constant path-element 
(object (lambda (path elt) 
(vref path elt)) 
((setter self) 
(lambda (path elt val) 
(set (vref path elt) val))))) 


(define (make-path n) 
(make-vector n ’#f)) 


(define copy-path copy-vector) 


(define (city-part-of-path city path) 
(let ((len (path-steps path))) 
(iterate loop ((i 0)) 
(cond ((fx>= i len) ’#f) 
((eq? city (path-element path i)) ’#t) 
(else (loop (fx+ i 1))))))) 


(define (path-length path ic-d-mat) 
(let ((len (path-steps path))) 
(iterate loop ((step 1) (p-city (path-element path 0)) (sum 0)) 
(if (fx>= step len) 
sum 
(let ((current-city (path-element path step))) 
(if (null? current-city) 
sum 
(loop 
(fx+ step 1) 
current-city 
(fx+ sum 
(ic-dist p-city current-city ic-d-mat))))))))) 


(define-integrable (path-steps path) 
(vector-length path)) 


(define (output-path path cities) 
(cons path 
(iterate loop ((i 0) (coords ’())) 
(if (fx>= i (path-steps path)) 
(reverse coords) 
(loop (fx+ i 1) 
(cons (nth cities (path-element path i)) 
coords) ))))) 


(define-constant (ic-dist x y ic-mat) 
(vref (vref ic-mat x) y)) 
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(herald cities) 


393 
333; Lists of city coordinates (unordered). 


»>99 


(define unordered-cities 
mG@) 
((-1 . 1)) 
((-2 . 0) (-2. -1)) 
((-1 . 3) (3. -1) (-1 . 0)) 
((-3 . -1) (-2 .1) (. 2) (-1 . 4)) 
((-5 . -3) (-2 . -2) (-1 . 3) (1. 2) (1. 0)) 
((6 . 6) (6. 1) (0. 0) (2. 0) (-4 . 1) (6. -6)) 
((-1 . 3) (-5 . 0) (1. 5) (-2. 1) (5. -5) (-3 . 6) (3. 6)) 
((4 . -1) (-6. 3) (-1. 4) (2. -5) (7 . 7) (3. -4) (8 . 2) 
(-1 . 6)) 
((1.. 7) (1. 4) (-5. 1) (6. -6) (-9 . -2) (1 . 4) (8 . -6) 
(5 . -2) (-4 . 9)) 
((2 . -8) (1. 0) (7 . -10) (-3 . -9) (-6 . 8) (-5 . -8) 
(2. -9) (-3 . 3) (1. 7) (10. 10)) 
((5 . -6) (-1 . -9) (-2.0) (1. -9) (8 . -9) (-2 . 6) (O . 10) 
(5.1) (7 . 1) (-9 . 9) (0 . 6)) 
((11 . 11) (-9 . 12) (8 . -3) (12 . -10) (5 . -4) (-11 . 9) 
(-1 . 3) (4.9) (-3 . -2) (-9 . 10) (-7 . 11) 
(-7 . 6)))) 


333 Identical lists of city coordinates (ordered by greedy algorithm). 


»>99 


(define ordered-cities 
"CO 

((-1 . 1)) 

((=2. 2 0) (=2) ..=4)) 

((-1 . 3) (-1 . 0) (3. -1)) 

((-3 . -1) (-2 . 1) (. 2) (-1 . 4)) 

((-5 . -3) (-2 . -2) (1.0) (1. 2) (-1 . 3)) 

(6 . 6) (6. 1) (2. 0) (0. 0) (-4.. 1) (-6 . -6)) 

((-1 . 3) (1. 5) (3 . 6) (-3 . 6) (-2. 1) (-5.. 0) (5. -5)) 

((4 .. -1) (3. -4) (2. -5) (-1 . 4) (-1 . 6) (-6.. 3) (7. 7) 
(8 . 2)) 

(1.7) (2.4) GQ. 4 (-5 . 1) (9. -2) (-4 .9) (5 . -2) 
(6 . -6) (8 . -6)) 

((2 . -8) (2. -9) (7 . -10) (-3 . -9) (-5 . -8) (1. 0) (-3 . 3) 
(-6 . 8) (1. 7) (10. 10)) 

((5 . -6) (8 . -9) (1. -9) (-1 . -9) (-2 . 0) (-2 . 6) (0. 6) 
(O . 10) (-9 . 9) (5 . 1) (7. 1)) 

((11 . 11) (8 . -3) (5 . -4) (-3 . -2) (-1 . 3) (-4. 9) 

(-7 . 11) (-9 . 12) (-9 . 10) (-11 . 9) (-7 . 6) 
(12 . -10)))) 
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d93 
;;3; Identical lists of city coordinates (optimal ordering). 


»>99 


(define opt-ordered-cities 
CO 
Cade A) 
O02? 2 OF HD 4 9 
((-1 . 3) (-1 . 0) (3. -1)) 
((-3 . -1) (-2.. 1) (@ . 2) (1. 4)) 
((-5 . -3) (-2.. -2) (1. 0) (1. 2) (1. 3)) 
((6 . 6) (6. 1) (2. 0) (0 . 0) (-4. 1) (-6 . -6)) 
((-1. 3) @. 5) (3... 6) (-3 . 6) (-5 . 0) (2. 1) ( . -5)) 
((4. -1) (2. -5) (3. -4) (8 . 2) (7 . 7) (1. 4) (1. 6) 
(-6 . 3)) 
(C1. 7) (-4. 9) (-9 . -2) (5.1) 4.4) 1. 4) (. -2) 
(6. 36) (8° 2-67) 
((2 . -8) (7 . -10) (2 . -9) (-3 . -9) (-5 . -8) (1 . 0) (-3 . 3) 
(-6 . 8) (1. 7) (10 . 10)) 
((5 . -6) (8 . -9) (1. -9) (-1. -9) (-2.. 0) (5 . 1) (7. 1) 
(0 . 6) (-2 . 6) (0 . 10) (-9 . 9)) 
(C11. 11) (-4 . 9) (-7 . 11) (-9 . 12) (-9 . 10) (-11 . 9) 
(-7 . 6) (-1 . 3) (-3 . -2) (5 . -4) (8 . -3) 
(12° » =10)))) 
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393 
333; Intercity distance matrices for unordered cities 


»>99 


(define unordered-ic-d-mats 


#C4EO 

#(#(0)) 

#(#(0 1) 
#(1 0)) 

#(#(0 5 3) 
#(5 0 4) 
#(3 4 0)) 

#(#(0 2 4 5) 
#(2 0 2 3) 
#(4 2 0 2) 
#(5 3 2 0)) 

#(#(0 3 7 7 6) 
#(30 5 5 3) 
#(7 5 0 2 3) 
#(7 5 2 0 2) 
#(6 3 3 2 0)) 

#(#(0 5 8 7 11 16) 
#(5 0 6 4 10 13) 
#(8 6 0 2 4 8) 
#(7 4 2 0 6 10) 
#(11 10 4 6 0O 7) 
#(16 13 810 7 0O)) 

#(#(00 5 2 210 3 5) 
#(5 0 7 311 6 10) 
#(2 7 0 510 4 2) 
#(2 3 5 09 5 7) 
#(10 11 10 9 O 13 11) 
#(3 6 4 513 0 6) 
#(510 2 711 6 0O)) 

#(#(010 7 4 8 3 5 8) 
#(10 0 5 11 131114 5) 
#(7 5 0 9 8 9 2) 
#( 411 9 013 1 9 11) 
#( 813 813 011 5 8) 
#( 311 8 111 0 7 10) 
#( 514 9 9 5 7 0 Q) 
#(8 5 211 810 9 0O)) 
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0)))) 


393 
333; Intercity distance matrices for ordered cities 


»>99 


(define ordered-ic-d-mats 


#CEO 

#(#(0)) 

#(#(0 1) 
#(1 0)) 

#(#(0 3 5) 
#(3 0 4) 
#(5 4 0)) 

#(#(0 2 4 5) 
#(2 0 2 3) 
#(4 2 0 2) 
#(5 3 2 0)) 

#(#(0 3 6 7 7) 
#(3 0 3 5 5) 
#(6 3 0 2 3) 
#(7 5 2 0 2) 
#(7 5 3 2 0)) 

#(#(0 5 7 8 11 16) 
#(5 0 4 6 10 13) 
#(7 4 0 2 6 10) 
#(8 6 2 0 4 8) 
#(11 10 6 4 0 7) 
#(16 13 10 8 7 O0)) 

#(#(00 2 5 3 2 5 10) 
#(2 0 2 4 5 7 10) 
#(5 2 0 6 7 10 11) 
#(3 4 6 0 5 6 13) 
#(2 5 7 5 0 3 Q) 
#(5 710 6 3 O 11) 
#(10 10 11 13 9 11 0)) 

#(#(0 3 4 7 810 8 5) 
#(3 0 1 81041111 7) 
#(4 1 0 9111113 9) 
#(7 8 9 0 2 5 8 Q) 
#( 810112 0 5 8 Q) 
#(10 11 11 5 5 O 13 14) 
#( 81113 8 813 0 5) 
#(5 79 9 914 5 0O)) 
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393 
333 Intercity distance matrices for ordered cities 


»>99 


(define opt-ordered-ic-d-mats 


#CEO 

#(#(0)) 

#(#(0 1) 
#(1 0)) 

#(#(0 3 5) 
#(3 0 4) 
#(5 4 0)) 

#(#(0 2 4 5) 
#(2 0 2 3) 
#(4 2 0 2) 
#(5 3 2 0)) 

#(#(0 3 6 7 7) 
#(3 0 3 5 5) 
#(6 3 0 2 3) 
#(7 5 2 0 2) 
#(7 5 3 2 0)) 

#(#(0 5 7 8 11 16) 
#(5 0 4 6 10 13) 
#(7 4 0 2 6 10) 
#(8 6 2 0 4 8) 
#(11 10 6 4 0 7) 
#(16 13 10 8 7 0)) 

#(#(00 2 5 3 5 2 10) 
#(2 0 2 4 7 §5 10) 
#(5 2 0 610 7 11) 
#(3 4 6 0 6 5 13) 
#(5 710 6 O 8 11) 
#(2 5 7 5 3 0 Q) 
#(10 10 11 13 11 9 0)) 

#(#(00 4 3 5 8 7 8 10) 
#(4 0 1 913 9 11 11) 
#(3 1 0 711 8 10 11) 
#(5 9 7 0 5 9 9Q 14) 
#( 81311 5 0 8 8 13) 
#(7 9 8 9 8 0 2 5) 
#( 81110 9 8 2 0 5) 
#(10 11 11 14 13 5 5 0O)) 
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B.4 UNBAL 


B.4.1 Dynamic 


(herald generate-n-tasks) 


(lset *tasks-remaining* 0) 


(define (main n t) 
(set *task-cycles* t) 
(spawn-and-wait (lambda () (pause t)) n)) 


(define (spawn-and-wait thunk n) 
(set *tasks-remaining* n) 
(let* ((done (make-placeholder)) 
(thunk-1 (lambda () 
(thunk) 
(when (fx<= (modify *tasks-remaining* 
(lambda (x) (fx- x 1))) 
0) 
(determine done ’#t))))) 
(iterate loop ((tasks ’()) (i n)) 
(if (fx> i 0) 
(loop (cons (make-dummy-task thunk-1) tasks) (fx- i 1)) 
(sched-tasks (link-tasks tasks) ))) 
(touch done) )) 


(define (make-dummy-task thunk) 
(let ((new-task (make-task)) 
(old-task (get-my-task) )) 
(when old-task 
(set (task-level new-task) (fx+ (task-level old-task) 1))) 
(set (task-created-on new-task) *my-pid*) 
(set (task-closure new-task) 

(new-task-wrapper new-task thunk ’())) 
(stats-creating-task) 
(task-message "Creating 
new-task) ) 


" new-task) 


(define (link-tasks tasks) 
(if (mull? tasks) 
20) 
(let ((first (car tasks))) 
(iterate loop ((current (car tasks)) (rest (cdr tasks) )) 
(cond ((null? rest) first) 

(else 
(set (task-next current) (car rest)) 
(loop (car rest) (cdr rest)))))))) 
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B.4.2 Static 


(herald generate-n-stat) 


(lset *tasks-remaining* 0) 


(define (main n t) 
(set *task-cycles* t) 
(spawn-and-wait (lambda () (pause t)) n)) 


(define (spawn-and-wait thunk n) 
(set *tasks-remaining* n) 
(let* ((done (make-placeholder) ) 
(n-local (fx/ (fx+ n (fx- *N-Processors* 1)) *N-Processors*) ) 
(thunk-1 (lambda () 
(thunk) 
(when (fx<= (modify *tasks-remaining* 
(lambda (x) (fx- x 1))) 
0) 
(determine done ’#t))))) 
(do-in-parallel (r c) 
(iterate loop ((tasks ’()) (i n-local)) 
(if (fx> i 0) 
(loop (cons (make-dummy-task thunk-1) tasks) (fx- i 1)) 
(sched-tasks (link-tasks tasks))))) 
(touch done) )) 


(define (make-dummy-task thunk) 
(let ((new-task (make-task)) 
(old-task (get-my-task) )) 
(when old-task 
(set (task-level new-task) (fx+ (task-level old-task) 1))) 
(set (task-created-on new-task) *my-pid*) 
(set (task-closure new-task) 

(new-task-wrapper new-task thunk ’())) 
(stats-creating-task) 
(task-message "Creating 
new-task) ) 


" new-task) 


(define (link-tasks tasks) 
(if (mull? tasks) 
»() 
(let ((first (car tasks) )) 
(iterate loop ((current (car tasks)) (rest (cdr tasks) )) 
(cond ((null? rest) first) 

(else 
(set (task-next current) (car rest)) 
(loop (car rest) (cdr rest)))))))) 
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B.6 MATMUL 


B.5.1 Common Code 


393 
333 timing parameters 


»>99 


(define-constant *loop-cycles* 7) 


(define-constant *mul 
(define-constant *add 


-cycles* 40) 
-cycles* 1) 


(define-constant *matref-cycles* 7) 
(define-constant *matset-cycles* 8) 
(define-constant *lmatref-cycles* 10) 
(define-constant *lmatset-cycles* 20) 


(def ine-local-syntax 


(blocking-forpar header . body) 


(destructure (((name start end) header) 


(loop 


(generate-symbol ’loop)) 


(upper (generate-symbol ’upper)) 


(mid 


(generate-symbol ’mid)) 


(pl (generate-symbol ’placeholder) )) 


‘(iterate ,loop ( 


(,name ,start) (,upper ,end)) 


(cond ((fx> ,upper (fx+ ,name 1)) 
(let* ((,mid (fx+ ,name (fx-ashr (fx- ,upper ,name) 1))) 


(,pl (future (,loop ,name ,mid)))) 


(,loop ,mid ,upper) 
(touch ,pl))) 
(else ,@body))))) 
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The various versions of MATMUL were actually written as eight nearly identical programs. 
In this section, we present the code that is common to those programs first. 
separately list the code that contains differences. 


We then 


393 
333 the following code is snarfed from dmatrix.t 


»>99 


(define-integrable (‘my-ceiling x y) 
(fx/ (fx+ x (fx- y 1)) y)) 


333 index calculation: more efficient to use FP. 
(define (index->blockt+toffset i blocksize) 

;; returns (fx/ i blocksize), (fx-rem i blocksize) in 40 cycles. 

(let ((f1-i (fixnum->flonum i)) 

(f1-blocksize (fixnum->flonum blocksize) )) 
(let* ((quotient (flonum->fixnum (f1/ fl-i fl-blocksize))) 
(remainder (fx- i (flonum->fixnum 
(f1* (fixnum->flonum quotient) 
f1-blocksize))))) 
(return quotient remainder) ))) 


393 
33; %dmatrix data structure 


»>99 


(define-structure /%dmatrix 
top-matrix 
submat-w 
submat—h) 


(define (/make-dmatrix height width make-mat-fn val) 
(let* ((radix *Procs-Per-Dim*) 
(top-matrix (make-matrix radix radix)) 
(submat-h (/my-ceiling height radix)) 
(submat-w (Amy-ceiling width radix) )) 
(do-in-parallel (row col) 
(set (MATREF top-matrix row col) 
(make-mat-fn submat-h submat-w val))) 
(let ((dm (make-%dmatrix) )) 
(set ({dmatrix-top-matrix dm) (CREATE-DIR-ENTRY top-matrix)) 
(set ({dmatrix-submat-h dm) (CREATE-DIR-ENTRY submat-h) ) 
(set ({dmatrix-submat-w dm) (CREATE-DIR-ENTRY submat-w) ) 
dm) )) 
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333 MATRIX 


»>99 


(define (MAKE-MATRIX height width . val) 
(let ((matrix (make-vector height) ) 
(initval (if val (car val) 0))) 
(do (Crow 0 (fx+ row 1))) 
((fx= row height)) 
(let ((vec (make-vector width))) 
(dotimes (i width) 
(set (vref vec i) (CREATE-DIR-ENTRY initval))) 
(set (vref matrix row) (CREATE-DIR-ENTRY vec)))) 
matrix) ) 


(define-constant MATREF 
(object (lambda (matrix row col) 
(DIR-READ (vref (DIR-READ (vref matrix row)) col))) 
((setter self) 
(lambda (matrix row col value) 
(DIR-WRITE (vref (DIR-READ (vref matrix row)) col) value))))) 


(define-integrable (MATRIX-HEIGHT m) 
(vector-length m)) 


(define-integrable (MATRIX-WIDTH m) 
(vector-length (DIR-READ (vref m 0)))) 
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33; LMATRIX (matrix of lstructs) 


»99 


(define (MAKE-LMATRIX height width . val) 
(let ((matrix (make-vector height) ) 
(initval (if val (car val) 0))) 
(do ((row 0 (fx+ row 1))) 
((fx= row height)) 
(let ((vec (make-vector width) )) 
(dotimes (i width) 
(let ((pl Gmake-placeholder) )) 
(set (placeholder-determined? pl) ’#t) 
(set (placeholder-value pl) initval) 
(set (vref vec i) (CREATE-DIR-ENTRY pl)))) 
(set (vref matrix row) (CREATE-DIR-ENTRY vec)))) 
matrix) ) 


(define-constant LMATREF 
(object (lambda (matrix row col) 
(let* ((dir (vref (DIR-READ (vref matrix row)) col)) 
(lcell (DIR-READ dir) ) 
(val (*lref 1cell))) 
(DIR-WRITE dir lcell) 
val)) 
((setter self) 
(lambda (matrix row col value) 
(let* ((dir (vref (DIR-READ (vref matrix row)) col)) 
(lcell (DIR-READ dir))) 
(*l-set lcell value) 
(DIR-WRITE dir lcell) 
value) )))) 


(define-integrable (LMATRIX-HEIGHT m) 
(vector-length m)) 


(define-integrable (LMATRIX-WIDTH m) 
(vector-length (DIR-READ (vref m 0)))) 
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355 
333 Distributed matrix 


»>99 


(define (MAKE-DMATRIX height width . val) 
(Amake-dmatrix height width MAKE-MATRIX (if val (car val) 0))) 


(define-constant DMATREF 
(object (lambda (dmat row col) 
(let ((sub-h (DIR-READ (%dmatrix-submat-h dmat))) 
(sub-w (DIR-READ (%dmatrix-submat-w dmat))) 
(top-matrix (DIR-READ (/dmatrix-top-matrix dmat)))) 
(receive (vblock voffset) 
(index->block+offset row sub-h) 
(receive (hblock hoffset) 
(index->block+offset col sub-w) 
(MATREF (MATREF top-matrix vblock hblock) 
voffset hoffset))))) 
((setter self) 
(lambda (dmat row col value) 

(let ((sub-h (DIR-READ (/dmatrix-submat-h dmat))) 
(sub-w (DIR-READ (%dmatrix-submat-w dmat))) 
(top-matrix (DIR-READ (/dmatrix-top-matrix dmat)))) 

(receive (vblock voffset) 
(index->block+offset row sub-h) 
(receive (hblock hoffset) 
(index->block+offset col sub-w) 
(set (MATREF (MATREF top-matrix vblock hblock) 
voffset hoffset) 
value)))))))) 


(define (DMATRIX-SUBMATRIX dm row col) 
(MATREF (DIR-READ (/dmatrix-top-matrix dm)) row col)) 
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393 
333 Distributed l-matrix 


»99 


(define (MAKE-DLMATRIX height width . val) 
(Amake-dmatrix height width MAKE-LMATRIX (if val (car val) 0))) 


(define-constant DLMATREF 
(object (lambda (dmat row col) 
(let ((sub-h (DIR-READ (%dmatrix-submat-h dmat))) 
(sub-w (DIR-READ (%dmatrix-submat-w dmat))) 
(top-matrix (DIR-READ (/dmatrix-top-matrix dmat)))) 
(receive (vblock voffset) 
(index->block+offset row sub-h) 
(receive (hblock hoffset) 
(index->block+offset col sub-w) 
(LMATREF (MATREF top-matrix vblock hblock) 
voffset hoffset))))) 
((setter self) 
(lambda (dmat row col value) 
(let ((sub-h (DIR-READ (/%dmatrix-submat-h dmat))) 
(sub-w (DIR-READ (%dmatrix-submat-w dmat))) 
(top-matrix (DIR-READ (/dmatrix-top-matrix dmat)))) 
(receive (vblock voffset) (index->block+offset row sub-h) 
(receive (hblock hoffset) (index->blockt+toffset col sub-w) 
(set (LMATREF (MATREF top-matrix vblock hblock) 
voffset hoffset) 
value)))))))) 


(define (DLMATRIX-SUBMATRIX dm row col) 
(MATREF (DIR-READ (/dmatrix-top-matrix dm)) row col)) 
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B.5.2 Cached Versions 


These versions of MATMUL simulate coherent full-mapped directories. The code for sim- 
ulation of cache operations that are kept coherent using those directories is given here. 


»>99 


333 Coherence protocol constants 


»>99 


(def ine-constant 
(def ine-constant 
(def ine-constant 
(def ine-constant 
(def ine-constant 
(def ine-constant 
(def ine-constant 
(def ine-constant 


(def ine-constant 
(def ine-constant 
(def ine-constant 
(def ine-constant 
(def ine-constant 
(def ine-constant 
(def ine-constant 
(def ine-constant 


»>99 
»>99 


»>99 


*rreq-msg-size* 8) 
*rresp-msg-size* 24) 
*wreq-msg-size* 8) 
*wresp-msg-size* 24) 
*invr-msg-size* 8) 
*invr-ack-msg-size* 8) 
*invw-msg-size* 8) 
*update-msg-size* 24) 


*process-rreq-cycles* 4) 
*process-rresp-cycles* 4) 
*process-wreq-cycles* 4) 
*process-wresp-cycles* 4) 
*process-invr-cycles* 4) 
*process-invr-ack-cycles* 4) 
*process-invw-cycles* 4) 
*process-update-cycles* 4) 


DIR-ENTRY abstraction 


(define-structure DIR-ENTRY 


HOME-PID 


DIRECTORY ;; write: [fixnum] pid 


a»? 
a9 
a»? 


VALUE 


read: 


[list] (len <pid> <pid> ...) 
[vector] #(<O has permission> 


<i has permission> ... 


(((print self port) 


(format port 


"#{DIR-ENTRY (~s) <~s:~s> ~s}" 


(object-hash self) 

(DIR-ENTRY-HOME-PID self) 
(DIR-ENTRY-DIRECTORY self) 
(DIR-ENTRY-VALUE self))))) 
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(define-constant *max-directory-list-length* 16) 


(define (CREATE-DIR-ENTRY val) 
(let ((de (make-dir-entry))) 
(set (dir-entry-home-pid de) *my-pid*) ; home node is here 
(set (dir-entry-directory de) *my-pid*) ; I get write permission 
(set (dir-entry-value de) val) 
de) ) 


(define-integrable (DIR-READ x) 
(get-read-permission x) 
(DIR-ENTRY-VALUE x)) 


(define-integrable (DIR-WRITE x val) 
(get-write-permission x) 
(set (DIR-ENTRY-VALUE x) val)) 


(define-integrable (pid->dir-index pid) 
(fx-ashr pid 4)) 
(define-integrable (pid->dir-bit pid) 
(fx-ashl 1 (fx-and pid #xf))) 
(define-integrable (I-have-write-permission x) 
(has-write-permission *my-pid* x)) 
(define-integrable (has-write-permission pid x) 
(let ((dir (dir-entry-directory x))) 


(and (fixnum? dir) (fx= dir pid)))) 


(define-integrable (add-write-permission pid x) 
(set (dir-entry-directory x) pid)) 
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(define-integrable (I-have-read-permission x) 
(let ((dir (dir-entry-directory x))) 
(or (and (list? dir) (has-read-permission-list *my-pid* x)) 
(and (vector? dir) (has-read-permission-vector *my-pid* x)) 
(I-have-write-permission x)))) 


(define-integrable (has-read-permission-list pid x) 
(Ahas-read-permission-list pid (dir-entry-directory x))) 


(define-integrable (4has-read-permission-list pid 1) 
(memq? pid (cdr 1))) 


(define-integrable (has-read-permission-vector pid x) 
(Ahas-read-permission-vector pid (dir-entry-directory x))) 


(define-integrable (4has-read-permission-vector pid vec) 
(fxn= (fx-and (vref vec (pid->dir-index pid)) 
(pid->dir-bit pid)) 
0)) 


(define-integrable (add-read-permission-list pid x) 
(let ((1 (dir-entry-directory x))) 
(set (cdr 1) (cons pid (cdr 1))) 
(set (car 1) (fx+ (car 1) 1)))) 


(define-integrable (add-read-permission-vector pid x) 
(let ((vec (dir-entry-directory x)) 
(index (pid->dir-index pid))) 
(set (vref vec index) 
(fx-ior (vref vec index) (pid->dir-bit pid))))) 
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(define-integrable (get-read-permission x) 
(cond ((I-have-read-permission x)) 

((fixnum? (dir-entry-directory x)) ; someone else has write 
(get-read-permission-from-write-state x)) ; permission 
((vector? (dir-entry-directory x)) 
(get-read-permission-from-read-state-vector x)) 

(else 

(get-read-permission-from-read-state-list x)))) 


(define (get-read-permission-from-write-state x) 
(let ((fpid (dir-entry-directory x))) 
(set (dir-entry-directory x) (cons 1 (cons *my-pid* ’()))) 
(pause-read-from-write-time x fpid))) 


(define (get-read-permission-from-read-state-vector x) 
(add-read-permission-vector *my-pid* x) 
(pause-read-from-read-time x)) 


(define (get-read-permission-from-read-state-list x) 
(cond ((fx>= (car (dir-entry-directory x)) 
*max-directory-list-length*) 

(read-directory-list->vector x) 
(get-read-permission-from-read-state-vector x)) 
(else 
(add-read-permission-list *my-pid* x) 
(pause-read-from-read-time x)))) 


(define (read-directory-list->vector x) 
(let ((1 (dir-entry-directory x)) 
(v (make-vector (fx-ashr *n-processors* 4)))) 
(set (dir-entry-directory x) v) 
(dolist (pid (cdr 1)) 
(add-read-permission-vector pid x)))) 


171 


(define (pause-read-from-read-time x) 
(let ((hpid (dir-entry-home-pid x))) 
(pause (+ (transit-time *rreq-msg-size* *my-pid* hpid) 
*process-rreq-cycles* 
(transit-time *rresp-msg-size* hpid *my-pid*) 
*process-rresp-cycles*) ))) 


(define (pause-read-from-write-time x fpid) 
(let ((hpid (dir-entry-home-pid x))) 
(pause (+ (transit-time *rreq-msg-size* *my-pid* hpid) 

*process-rreq-cycles* 
(transit-time *invw-msg-size* hpid fpid) 
*process-invw-cycles* 
(transit-time *update-msg-size* fpid hpid) 
*process-update-cycles* 
(transit-time *rresp-msg-size* hpid *my-pid*) 
*process-rresp-cycles*) ))) 
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(define-integrable (get-write-permission x) 
(cond ((I-have-write-permission x)) 

((fixnum? (dir-entry-directory x)) ; someone else has write 
(get-write-permission-from-write-state x)) ; permission 

((vector? (dir-entry-directory x)) 
(get-write-permission-from-read-state-vector x)) 

(else 
(get-write-permission-from-read-state-list x)))) 


(define (get-write-permission-from-write-state x) 
(let ((fpid (dir-entry-directory x))) 
(add-write-permission *my-pid* x) 
(pause-write-from-write-time x fpid))) 


(define (get-write-permission-from-read-state-vector x) 
(let ((dir (dir-entry-directory x))) 
(add-write-permission *my-pid* x) 
(pause-write-from-read-time-vector x dir))) 


(define (get-write-permission-from-read-state-list x) 
(let ((dir (dir-entry-directory x))) 
(add-write-permission *my-pid* x) 
(pause-write-from-read-time-list x dir))) 
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(define (pause-write-from-write-time x fpid) 
(let ((hpid (dir-entry-home-pid x))) 
(pause (+ (transit-time *wreq-msg-size* *my-pid* hpid) 

*process-wreq-cycles* 
(transit-time *invw-msg-size* hpid fpid) 
*process-invw-cycles* 
(transit-time *update-msg-size* fpid hpid) 
*process-update-cycles* 
(transit-time *wresp-msg-size* hpid *my-pid*) 
*process-wresp-cycles*) ))) 


(define-integrable (pause-write-from-read-time-vector x vec) 
(pause-write-from-read-time x vec {has-read-permission-vector) ) 


(define-integrable (pause-write-from-read-time-list x 1) 
(pause-write-from-read-time x 1 (4has-read-permission-list) ) 


(define (pause-write-from-read-time x dir permission-fn) 
(let ((hpid (dir-entry-home-pid x))) 
(pause (+ (transit-time *wreq-msg-size* *my-pid* hpid) 
*process-wreq-cycles* 
(transit-time *wresp-msg-size* hpid *my-pid*) 
*process-wresp-cycles*) ) 
(iterate loop ((pid 0) (mn 0) (max-dist 0) (max-pid hpid)) 
(cond ((fx>= pid *N-Processors*) 
(pause (+ (transit-time *invr-msg-size* hpid max-pid) 
*process-invr-cycles* 
(transit-time *invr-ack-msg-size* max-pid hpid) 
*process-invr-ack-cycles* 
(* (fx- n 1) *invr-msg-size*)))) 
((permission-fn pid dir) 
(let ((dist (node-distance hpid pid))) 
(if (fx<= dist max-dist) 
(loop (fx+ pid 1) (fx+ n 1) max-dist max-pid) 
(loop (fx+ pid 1) (fx+ n 1) dist pid)))) 
(else 
(loop (fx+ pid 1) n max-dist max-pid)))))) 


174 


Coarse-Grained, Dynamic, With Caches 


(define (main i j k) 
(let ((m1 (MAKE-DMATRIX i j 1)) 

(m2 (MAKE-DMATRIX j k 2)) 

(m3 (MAKE-DMATRIX i k 0))) 
(message "finished initialization!") 
(matmul m1 m2 m3) 
m3) ) 


(define (matmul m1 m2 m3) 
(let ((n *Procs-Per-Dim*) ) 
(blocking-forpar (i 0 n) 
(blocking-forpar (j 0 n) 
(matmul-row-col i j mi m2 m3 n))))) 


(define (matmul-row-col i j mi m2 m3 n) 
(for (k 0 n) 
(matmul-blocks (DMATRIX-SUBMATRIX mi i k) 
(DMATRIX-SUBMATRIX m2 k j) 
(DMATRIX-SUBMATRIX m3 i j)))) 


(define (matmul-blocks x y z) 
(let ((x-height (MATRIX-HEIGHT x)) 
(x-width (MATRIX-WIDTH x)) 
(y-width (MATRIX-WIDTH y))) 
(dotimes (i x-height) 
(pause *loop-cycles*) 
(dotimes (j y-width) 
(pause *loop-cycles*) 
(set (MATREF z i j) 
(fx+ (MATREF z i j) 
(acc (k 0 x-width) 
(pause (+ *mul-cycles* *add-cycles* 
*loop-cycles*) ) 
(fx* (MATREF x i k) (MATREF y k j))))) 
(pause (+ *matref-cycles* *add-cycles* *matset-cycles*)))))) 
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Coarse-Grained, Static, With Caches 


(define (main i j k) 
(let ((m1 (MAKE-DMATRIX i j 1)) 

(m2 (MAKE-DMATRIX j k 2)) 

(m3 (MAKE-DMATRIX i k 0))) 
(message "finished initialization!") 
(matmul m1 m2 m3) 
m3) ) 


(define (matmul mi m2 m3) 
(let ((m *Procs-Per-Dim*) ) 
(do-in-parallel (i j) 
(for (k 0 n) 
(matmul-blocks (DMATRIX-SUBMATRIX mi i k) 
(DMATRIX-SUBMATRIX m2 k j) 
(DMATRIX-SUBMATRIX m3 i j)))))) 


(define (matmul-blocks x y 2) 
(let ((x-height (MATRIX-HEIGHT x)) 
(x-width (MATRIX-WIDTH x)) 
(y-width (MATRIX-WIDTH y))) 
(dotimes (i x-height) 
(pause *loop-cycles*) 
(dotimes (j y-width) 
(pause *loop-cycles*) 
(set (MATREF z i j) 
(fx+ (MATREF z i j) 
(acc (k 0 x-width) 
(pause (+ *mul-cycles* *add-cycles* 
*loop-cycles*) ) 
(fx* (MATREF x i k) (MATREF y k j))))) 
(pause (+ *matref-cycles* *add-cycles* *matset-cycles*)))))) 
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Fine-Grained, Dynamic, With Caches 


(define (main i j k) 
(let ((m1 (MAKE-DMATRIX i j 1)) 

(m2 (MAKE-DMATRIX j k 2)) 

(m3 (MAKE-DLMATRIX i k 0))) 
(message "finished initialization!") 
(matmul m1 m2 m3) 
m3) ) 


(define (matmul mi m2 m3) 
(let ((n *Procs-Per-Dim*) ) 
(blocking-forpar (i 0 n) 
(blocking-forpar (j 0 n) 
(blocking-forpar (k 0 n) 
(matmul-blocks (DMATRIX-SUBMATRIX mi i k) 
(DMATRIX-SUBMATRIX m2 k j) 
(DLMATRIX-SUBMATRIX m3 i j))))))) 


(define (matmul-blocks x y 2) 
(let ((x-height (MATRIX-HEIGHT x)) 
(x-width (MATRIX-WIDTH x)) 
(y-width (MATRIX-WIDTH y))) 
(dotimes (i x-height) 
(pause *loop-cycles*) 
(dotimes (j y-width) 
(pause *loop-cycles*) 
(set (LMATREF z i j) 
(fx+ (LMATREF z i j) 
(acc (k 0 x-width) 
(pause (+ *mul-cycles* *add-cycles* 
*loop-cycles*) ) 
(fx* (MATREF x i k) (MATREF y k j))))) 
(pause (+ *lmatref-cycles* *add-cycles* *lmatset-cycles*)))))) 
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Fine-Grained, Static, With Caches 


(define (main i j k) 
(let ((m1 (MAKE-DMATRIX i j 1)) 

(m2 (MAKE-DMATRIX j k 2)) 

(m3 (MAKE-DLMATRIX i k 0))) 
(message "finished initialization!") 
(matmul m1 m2 m3) 
m3) ) 


(define (matmul mi m2 m3) 
(let ((m *Procs-Per-Dim*) ) 
(do-in-parallel (i j) 
(blocking-forpar (k 0 n) 
(matmul-blocks (DMATRIX-SUBMATRIX mi i k) 
(DMATRIX-SUBMATRIX m2 k j) 
(DLMATRIX-SUBMATRIX m3 i j)))))) 


(define (matmul-blocks x y 2) 
(let ((x-height (MATRIX-HEIGHT x)) 
(x-width (MATRIX-WIDTH x)) 
(y-width (MATRIX-WIDTH y))) 
(dotimes (i x-height) 
(pause *loop-cycles*) 
(dotimes (j y-width) 
(pause *loop-cycles*) 
(set (LMATREF z i j) 
(fx+ (LMATREF z i j) 
(acc (k 0 x-width) 
(pause (+ *mul-cycles* *add-cycles* 
*loop-cycles*) ) 
(fx* (MATREF x i k) (MATREF y k j))))) 
(pause (+ *lmatref-cycles* *add-cycles* *lmatset-cycles*)))))) 
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B.5.3. Uncached Versions 


These versions of MATMUL do not simulate caches for global memory. The caching code 
given above is replaced by the following macros. 


393 
333 No Coherent Caches -- reads and writes of shared structures always 
3335 generate reads and writes over the network. 


»>99 


(define-local-syntax (DIR-READ x) 
(let ((data (generate-symbol ’data)) 
(pid (generate-symbol ’pid)) 
(rval (generate-symbol ’rval))) 
‘(let* ((,data ,x) 
(,pid (DIR-ENTRY-HOME-PID ,data)) 
(,rval (cond ((fx= ,pid *my-pid*) 
(pause *process-rreq-cycles*) 
(DIR-ENTRY-VALUE ,data)) 
(else 
(remote-access 
»pid 
*rreq-msg-size* 
*process-rreq-cycles* 
*rresp-msg-size* 
(DIR-ENTRY-VALUE ,data)))))) 
(pause *process-rresp-cycles*) 
»rval))) 


(define-local-syntax (DIR-WRITE x val) 
(let ((data (generate-symbol ’data)) 
(pid (generate-symbol ’pid)) 
(wval (generate-symbol ’wval))) 
“Clet* ((,data ,x) 
(,wval (cond ((fx= ,pid *my-pid*) 
(pause *process-wreq-cycles*) 
(set (DIR-ENTRY-VALUE ,data) ,val)) 
(else 
(remote-access 
»pid 
*wreq-msg-size* 
*process-wreq-cycles* 
*wresp-msg-size* 
(set (DIR-ENTRY-VALUE ,data) ,val)))))) 
(pause *process-wresp-cycles*) 
,wval))) 
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Coarse-Grained, Dynamic, No Caches 


(define (main i j k) 
(let ((m1 (MAKE-DMATRIX i j 1)) 

(m2 (MAKE-DMATRIX j k 2)) 

(m3 (MAKE-DMATRIX i k 0))) 
(message "finished initialization!") 
(matmul m1 m2 m3) 
m3) ) 


(define (matmul m1 m2 m3) 
(let ((n *Procs-Per-Dim*) ) 
(blocking-forpar (i 0 n) 
(blocking-forpar (j 0 n) 
(matmul-row-col i j mi m2 m3 n))))) 


(define (matmul-row-col i j m1 m2 m3 n) 
(for (k 0 n) 
(matmul-blocks (DMATRIX-SUBMATRIX mi i k) 
(DMATRIX-SUBMATRIX m2 k j) 
(DMATRIX-SUBMATRIX m3 i j)))) 


(define (matmul-blocks x y 2) 
(let ((x-height (MATRIX-HEIGHT x)) 
(x-width (MATRIX-WIDTH x)) 
(y-width (MATRIX-WIDTH y))) 
(dotimes (i x-height) 
(pause *loop-cycles*) 
(dotimes (j y-width) 
(pause *loop-cycles*) 
(set (MATREF z i j) 
(fx+ (MATREF z i j) 
(acc (k 0 x-width) 
(pause (+ *mul-cycles* *add-cycles* 
*loop-cycles*) ) 
(fx* (MATREF x i k) (MATREF y k j))))) 
(pause (+ *matref-cycles* *add-cycles* *matset-cycles*)))))) 
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Coarse-Grained, Static, No Caches 


(define (main i j k) 
(let ((m1 (MAKE-DMATRIX i j 1)) 

(m2 (MAKE-DMATRIX j k 2)) 

(m3 (MAKE-DMATRIX i k 0))) 
(message "finished initialization!") 
(matmul m1 m2 m3) 
m3) ) 


(define (matmul mi m2 m3) 
(let ((m *Procs-Per-Dim*) ) 
(do-in-parallel (i j) 
(for (k 0 n) 
(matmul-blocks (DMATRIX-SUBMATRIX mi i k) 
(DMATRIX-SUBMATRIX m2 k j) 
(DMATRIX-SUBMATRIX m3 i j)))))) 


(define (matmul-blocks x y 2) 
(let ((x-height (MATRIX-HEIGHT x)) 
(x-width (MATRIX-WIDTH x)) 
(y-width (MATRIX-WIDTH y))) 
(dotimes (i x-height) 
(pause *loop-cycles*) 
(dotimes (j y-width) 
(pause *loop-cycles*) 
(set (MATREF z i j) 
(fx+ (MATREF z i j) 
(acc (k 0 x-width) 
(pause (+ *mul-cycles* *add-cycles* 
*loop-cycles*) ) 
(fx* (MATREF x i k) (MATREF y k j))))) 
(pause (+ *matref-cycles* *add-cycles* *matset-cycles*)))))) 
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Fine-Grained, Dynamic, No Caches 


(define (main i j k) 
(let ((m1 (MAKE-DMATRIX i j 1)) 

(m2 (MAKE-DMATRIX j k 2)) 

(m3 (MAKE-DLMATRIX i k 0))) 
(message "finished initialization!") 
(matmul m1 m2 m3) 
m3) ) 


(define (matmul mi m2 m3) 
(let ((m *Procs-Per-Dim*) ) 
(blocking-forpar (i 0 n) 
(blocking-forpar (j 0 n) 
(blocking-forpar (k 0 n) 
(matmul-blocks (DMATRIX-SUBMATRIX mi i k) 
(DMATRIX-SUBMATRIX m2 k j) 
(DLMATRIX-SUBMATRIX m3 i j))))))) 


(define (matmul-blocks x y z) 
(let ((x-height (MATRIX-HEIGHT x)) 
(x-width (MATRIX-WIDTH x)) 
(y-width (MATRIX-WIDTH y))) 
(dotimes (i x-height) 
(pause *loop-cycles*) 
(dotimes (j y-width) 
(pause *loop-cycles*) 
(set (LMATREF z i j) 
(fx+ (LMATREF z i j) 
(acc (k 0 x-width) 
(pause (+ *mul-cycles* *add-cycles* 
*loop-cycles*) ) 
(fx* (MATREF x i k) (MATREF y k j))))) 
(pause (+ *lmatref-cycles* *add-cycles* *lmatset-cycles*)))))) 
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Fine-Grained, Static, No Caches 


(define (main i j k) 
(let ((m1 (MAKE-DMATRIX i j 1)) 

(m2 (MAKE-DMATRIX j k 2)) 

(m3 (MAKE-DLMATRIX i k 0))) 
(message "finished initialization!") 
(matmul m1 m2 m3) 
m3) ) 


(define (matmul mi m2 m3) 
(let ((m *Procs-Per-Dim*) ) 
(do-in-parallel (i j) 
(blocking-forpar (k 0 n) 
(matmul-blocks (DMATRIX-SUBMATRIX mi i k) 
(DMATRIX-SUBMATRIX m2 k j) 
(DLMATRIX-SUBMATRIX m3 i j)))))) 


(define (matmul-blocks x y z) 
(let ((x-height (MATRIX-HEIGHT x)) 
(x-width (MATRIX-WIDTH x)) 
(y-width (MATRIX-WIDTH y))) 
(dotimes (i x-height) 
(pause *loop-cycles*) 
(dotimes (j y-width) 
(pause *loop-cycles*) 
(set (LMATREF z i j) 
(fx+ (LMATREF z i j) 
(acc (k 0 x-width) 
(pause (+ *mul-cycles* *add-cycles* 
*loop-cycles*) ) 
(fx* (MATREF x i k) (MATREF y k j))))) 
(pause (+ *lmatref-cycles* *add-cycles* *lmatset-cycles*)))))) 
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Appendix C 


Raw Data 


The following tables list the running times for the various applications run under the various 
thread managers for the machine sizes and network speeds given. 


C.1 AQ 


C.1.1 t, = 1 Cycle / Flit-Hop 


a rr of process) 
Me T2561 | A006 | Toa | 
[Free-Ideal [[ 435035 | 112062 | 3f129 | Is0s7 | 14005 | 14005 | 14005 | 14006 | 
[P-ldeal_—_[[ 437191 | 118076 | 36085 | 18346 [16616 | 16027 | 16827 | 16627 | 
[ C-ldeal-1__|[ 437209 _| 115735 | 35832 | 18254 | 16te7_| 16970 | 1730d | 17432 | 
[ C-ldeal-2__|[ 437209 | 114419 | 36058 | 16989 | 16942 | 16875 | 17279 | THT | 
PRR _|[ 437191 [ar7456 [aoa | 19950_[ 17es4_| 18007 | 18604 | —— | 


PRR-2 | 37191 | rea0r | araer_[ 2135 | 1r9s2_| 188d | 18539 | —— | 
P Die | sr19 [126782 | 49405 | 44363 | 45505 | dors | | ——| 
pie || e119 | 136875 | 46455 | 33086 [soir | s40r9 | _—-| ——| 
PTTM || 4s7191_| 119280 | 39386 [21287 | 18493_[ 18410 | TSa85 | 20405 | 
PXTM___|[-asrr91_[ 116045 | 40749_[ 22085 | 25064 [24280 | 21566 | 21999 | 
PXTM-C__|[-asri91 [121966 | 47901 [26724 [31433 | 32781 | 29542 [29738 | 


Table C.1: AQ(0.5) — Running Times (cycles). 
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| p (number of processors) 

[Mer TT A TG | 1024 | 4096 | 16384 | 
PFree-Ideal [2131170 | sized | 141880 | 4030 | 1926 | T0149 | T0149 | 10149 
[P-ideal | 2idi7sd_| s0a40_| 150038 | as283 | 22013 | 1950 | 1943 | 10403 
[Celdeal-t fp 21ar7r2 | sizrid | 1as902 | ari39_| 93980] 19464 | 20100 | 20405 | 
[C-Tdeal-2_|[ 241772 | s42502 | Tasast_| qoTOd_| —2om0m | ToGTR | 20TRe | DOROI | 
PRR |[21arts4 | sasarr_[ sodas | saere | 2722 | 2eae | 37506 | ——| 


[2rrrst_| ia633_[ 149842 [52986 [33086 | 30014 | 26058 |__—— | 
DiI |[ 2180964 [592922 | 174096 | 89890_| T00RI0 | T0086 | —-- | _——| 
PDik2 || 2180964 | c00Ks5 | Irsad5 | os0e7 | sonia | os0e¢ [_--| ——| 
PTTM | 214i754_[s51045 [156386 | 53321 | 28386 | 24167 | 25013 | 25497 | 
PXTM___|[ 2141754 [550285 [T5232 | 54432 [30297 | 29041 | 27228 | 31580 | 
PXTM-C__|[21trarr_[ 560526 [158093 | 60323 | 54796 | foams | 72479 | 56160 | 


Table C.2: AQ(0.1) — Running Times (cycles). 


| p (number of processors) 
[Mer oT tT TT |] 1024 | 4096 | 16384 | 
PFree-Ideal [| 4148500 | 10sqi0 | 270sia [T3080 [97464 [16150 | 16150 | 16140 | 
[P-tdeal—_[[41s0a34_[ 1o9s081_| oaare1_|—77a50 | —s0024 | 19828 | 1N901 | 19530 | 
C-ldeakt_|[-aiaoi36 | Tos0eds_| 374270 | —TReGO_| —soseT_| Tors] B00as_| BOAT | 
[C-ldeal-2 || aisoi36_| Toros0r | a7aisa | Tones | —s1gas_| ToR60_| 197s9_| SORTR | 
PRR-1 || disoioa_| tossras_[ ororsa | _soom1 | —dowoe | —as72 [aso | — | 
-RR-2___|[-aismion_| tosa012_|[ arene | sree | —a0ae | —aRD5 | dor0s | ——— | 
Ppitei | doaaaao_[iidrras | 310130_| 119951 | Tasoai | i220 |_| | 
[Laaaaa0_| tireroa_| 320553 | 109790] 101305 [8686 | —-] ——] 
PrrM |p aisoasa_| 1050390] 28002 [91730 | 38703 20160 | 25081 | 25057 | 
-xtM___[[aiooi0s | 1058502] 287006 | 87052 | 50737 | 33773 | 34173 | 33001 | 
PxTM-C_[[disoi0s | io76asr | s0s0e1 | 97568 | _o1e7 | —s7907 | oso2i | G07 | 


Table C.3: AQ(0.05) — Running Times (cycles). 


a rr of proceso) 
[Me TY 256 | OT] 4090 | 10 | 
[Free-Ideal [| 2010118 | 5092266 | 1292960 | 33877 | 92863 | S255 | 20138 | 20138 | 
[P-Ideal__|[ 20212387 [5208344 | 1346799 | a44a55 | 96475 | 96430 | 25419 [25018 | 
 CTdeal-T_|[ 20201079 [ 5059778 | 1282056 | 341638 | 98190 [38078 | 26236 | 26977 | 
[ CTdeal-2_ || 20201079 [ 5063978 | 1280377 | 333620 | 97802 | q0GTI_| 25788 | 26007 | 
PRR || 20201061_[ soe2s77 | as7iaa | aser9 | Ti867a | siads | sis55 | —— | 


PRR-2 || 20201061 [5062805 | 1286878 | s5224 [ Ti9TIs | _rrs22 | 7a7e2 | __—— | 
Die |[ 20562663 [ 5581705 | 1448165 | 410432 | 297RRO | BIeaT2 | | _—— | 
PDie2 || 20562663 | 5690009_[ 108596 | 406s21_| 185665 | Tresst | _——| ~~ | 
PTTM || 20207061 [ 5083434 [1323368 | s65086 | 111199 [55613 | 36051 | S7Oq0_| 
-XTM___|[ 20207061 [ 5080273 | 1309375 | s59148 | T2191 | G4ard | 47564 | 42096 | 
PXTM-C__|[ 20201061 [ 5iss50¢ [1363204 | sorr22 [158337 | si6rd | 80557 | 86459 | 


Table C.4: AQ(0.01) — Running Times (cycles). 
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a rier of processors) 
[Me TT 2 Toe 0G |_| 
[Free-Ideal [| 42819580 | 10853983 | 275388 | 702964 | 184603 | so967 | 25864 | 23075 | 
[P-Tdeal__|[ 43073918 [11291466 | 2856836 | 723780 | 191625 | 62044 [30202 | 25601 | 
[ C-Tdeal-1__|[ 43062610_[ 1orre279_| 2717726 _[ T1037? | 189768 |_o287s | 32043 | 26622 | 
[ C-ldeal-2__|[ 43008273 | 10rstor7_| 27is5e | 698536 | 189189 | 63505 | 32768 | 26708 | 
PRR-1 || 43062502 [ torsorrs | 2720015 | ressor | 223000_| T2508 | Tires |__—— | 

[43062502 [Tor92555 | 2ras316 | reass2 | 210191 | 104606 | 105919 |__—— | 


PRR-2 | 
Pitt |[- 43832400 | i1rssro4 | 3050293 | 80642 | 449619 | 41495 |---| _—~| 
Ppirt2 || 43832400 | tiz7ees | s201159 | 851806 | 276455 | 241591 | | __——| 
PT TM || 43062592 | toso2275 | 268857 | 76424 | 214560 | 84572 | s1s7s | A162 | 
-XTM____|[ 43062592 | 1os09656 | 2763683 | 731563 | 226962 | 105896 | 64571 | 62186 | 
xT M-C__|[ 23068255 | 10974037 [2877550 | 800739 | s1z924 [141977 | 100445 | 131108 | 


Table C.5: AQ(0.005) — Running Times (cycles). 


| p (number of processors) 
[Mer T2856 | 024 | 4096 | 16384 | 
Free-Ideal_[[ 212501010 | Susile7s | 15000023 | s4dszs8 | S7a5e5 | 25107 | o9070 | DU8TT | 
P-Tdeal__|[ 213032152] ssod9007 | 14130614 | 3564005 | 900060 | 230586 | T1201 | 34875 
C-ldeal-1_|[ 213000518 | 53400304 | 13379388 | 3449093 | Srrdo5 | 238533 | Te261 | 38143 
[Crldeal-2 || 213615181] sa430059_| 13380000 | 337IROG | —RoATIT | 9387IS | TRAD | —FRHIA 
PRR-1 | 2rsq208n6 | saatoaso_[ iasas07d_| sizaTa2_| —o5i951_| aoaeT1_| weaaa |__| 
[_21s00500_[ sadae2e7_| Tadoeoas | ada7res | onaaTT_| sama | 3a0RT | ——] 


PRR2_| 
Di ___|[ 217a12192_[ sedos2r1 | i5120932_|388R689 | Tos0T20 | 916235 | —-| _—— | 

| 2tra40903_[-sorr2951_[ 15824024 | ar11792 | 1094066 | eto7as |__—-[_——| 
PTTM___|[ 21632152 | sad7o2s4 | is505753_| 3549928 | 950098 | 301909 | TISTTT | _G8726 | 
-XTM____|[-213626489_[53500740_| 13502808 | 3448766 | 932873 | 345939 | 157087 | 9614 | 
PXTM-C__|[2iseisi6s | sitreris | oeaase | s7aiser | Tiaidi9 | 426006 | 202013 | 15801 | 


Table C.6: AQ(0.001) — Running Times (cycles). 
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C.1.2 Variable t,, 


| p (number of processors) 

[Mer TT 56 | 1024 | 4096 | 16384 | 
Free-Ideal [| 2001185 | sooi0s | 10a001 | sasz88 | oass7 | sis2 | 20507 | 20537 | 
P-Tdeal__|[ 20212387 | szo0d53_| 1344079 | 340003 | 90621 | 0010 | 24814 | _2a7T5_ 
[C-Tdeal-i_ 20201007 | soso402 | 1989206 _| 334912 | 987s | 39590 | BORD | THO 
[C-ldeal-2 | s0oni007 | soeaiae | 127OT9_| seams | OTIaT | 410M] OTRTS |_2THRE | 
PRR-1 | 20201061 sosarie_[ 1s0ra70_| aes000_| Tasas0_|oe2a5_| 750 | ——_| 


PRR-2 || 20201061 | 5067324 | 1207604 | serr22 | 40735 | Tos37 | 97087 | —— | 
DiI | 20562131 | 5e7338_| Teaser | Moria | 325825 | aos6or | —--| ——| 
PDir2 | 20562131 | sode213 | 1636608 | A51022 | 250021 | 24727 | _—-|[ _——| 
PTTM || 20201061 | s0v9447 | ts24433_[ 362120 | 120822 [58782 | 3098s | 41860 | 
-XTM___|[ 20201061 | s0r9sa6 | Ta10847_[ s52478 | 126063 | 65980 | 53855 | 46578 | 
PXTM-C__|[ 20206724 [sisri7e | 1378281 | 399026 | 159428 [ rai? | 87160 | 127863 | 


Table C.7: AQ(0.01) —t, = 2 Cycles / Flit-Hop — Running Times (cycles). 


a rr oF processors) 
[Me] TOE | _A000 | Tae | 
[Free-Ideal [| 20101211 | 5092227 | To0ds6l | sara» | 916d | sfa7s | 20743 | 20raS | 
[P-Ideal || 20212425 | 5303939 | Ta45158 | 349025 | 98403 | 38643 | 25714 | 25778 | 
[ C-Ideart_|[ 20212477 _[ 5061063 | 1285544 | 336478 | 99937 | 40093 | 28199 | 29359 | 
[ CrIdeal-2_|[ 20207151 | 5065608 | 1283500 | 333864 | 98301 | 41170 [27550 | 30351 _| 
PRR-1 || 20207113 | 5066859 | 1308516 | 376101 | 163580 | 122435 | 14460 | ~~ | 
PRR-2 | 20201113 | sos5e51 | 1313859 | 375004 | 175093 | Ts0910 | 1etio7 | —— | 
P DiI || 20563652 [6565006 | 1829308 [528693 | s52823 | arrqes | —-| ——| 
PDir2 || 20563652 | 6732366 | TaN449 [524690 [306026 | soma | __—-[——| 
PTTM || 20201197 | 5085276_[ 1330267_[ 382041 | 126112 | 60308 | 44935 | 43057 | 
-XTM____|[ 20201235 [5073495 | Ta09611 | a5aq7s | 120086 [70082 [47941 | 46875 | 
PXTM-C__|[ 20201153 [5148725 [ 1370046 | 396630 | 146105 | Sor7I | 84630 | 93370 | 


Table C.8: AQ(0.01) —t, = 4 Cycles / Flit-Hop — Running Times (cycles). 


a riba of proceso) 
[Ma] TO FO | 
[Free-Tdeal [| 20101263 | s0s9i59 | 1296128 | ax6103 | 92870 | sa4d6 | 21D37 | 21037 | 
[P-Tdeal___|[ 20206850_[ 5305578 | 1347103 [346523 [98524 | 40262 [28300 | 271at | 
[C-Tdeal-1_ || 20201223_[ 5060382 | 1287232 [-aaedss | 101206 | 45203 | 3117S | 30517 | 
[ C-ldeal-2 || 20201223 | s0ss227 | 1287395 [330847 | 104544 | 43762 | 31813 | 38017 | 
PRR-1 [| 20201820_[ 5089136 | 1321956 | a397s4 | 226800 | 188330 | T9821 | ——| 

[2020191 [5100399 -| is3s196_[-s8a770_| 213222 | r98026 | 199654 | ——| 


PRR2 | 
DDife-1 || 20562577_[ 7935005 | 233ti66 [67266 | 440028 | ao8ee2 | —--[ _——| 
DDif-2 || 20568330_[ sosors3 | 2388092 [693360 | 478788 | aot68 | ——[ _——| 
PTTM || 20206986 [5093288 | 1346358 [378601 | 130401 | Tad97 | S401 | 67601 | 
-XTM___|[ 20201343 [5076740 _| 1309959359356 | 146317 | 72398 | 54588_[ 54303 | 
XT M-C__|[ 20201225 [ 5157686 | 1362555 [9967s | 149857 | 8729 | s2arT | 85041 | 


Table C.9: AQ(0.01) —t, = 8 Cycles / Flit-Hop — Running Times (cycles). 
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| 
[Med CT (OG | 
[Free-Tdeal [[ 20101367 | s0s7s82 | 1280253 | sa07e2 | 9sa77 | 35005 | 21432 | 
[P-Tdeal_—_|[ 20201349 [_sars67s_[ 1349120_[ 348855 | Too552 | 40071 |_28722 | 
[C-Tdeal-1__|[ 20201385_[ 5061883 | 1296083 [340302 | 105081 | 47603 | 38895 | 
[ C-Tdeal-2__|[ 20212711 [5073267 | 1287493 [342404 | 107550 | 49316 [A138 | 
PRR-1 || 20202985 [_sost770_[ 133048 [4m02R9 | 272499 | Br6916 | 254097 | 

[20201347_[si29123 [1353045 | 461425 | 317963 | 313609 | 304211 | 


PRR-2 | 

Dire || 20562599 | 10683379 |-ss33567 | 994517 | To25R2 | 687026 | —— | 
PDire2 || 20574141 | 10sa8a85 | st01989 | Tox6s99 | 685123 | 6546s | __—— | 
PTTM __|[ 20218124 | stores | isr4is2 | 423694 | Trsai9 | 93025 | 87635 | 
-XTM____|[- 20224039 | s101166 | 1323069 | 382214 | 180691 | 94803 | 84160 | 
[XT M-C__|[ 20207387 |~s162093 | 368166 [—a30721 | roar | 93938 [93656 | 


Table C.10: AQ(0.01) - t, = 16 Cycles / Flit-Hop — Running Times (cycles). 


| p (number of processors) 
[Mer TT TT TAA | _ 4096 | 
PFree-Ideal [20107722 | s0oo48 | 1202733 | sa5e58 | oso52 | 30017 | 90175 | 
[P-Tdeal_—_[[ 20300380 | —s2T9082 | 1344262 | 354152 | 119200] —s9733 | 55886 
[-C-Tdeal-1 [| 20300533 | —sooedao | 1330288 | 388020 | 139250 | 83000 | 92458 
[Caldeal-2 || 20300553 | —srizide | taieze2 | 82504 | 144500 90079 | 115074 | 
CRR=1 || 2os00a35_[sor0rd0_| tso9019_[s0ReTO | —Gasz66 | —Ga42TO | GISOTS | 
| 2os00as5 | —saio8s2_[ 1550063 | —sOR50 | _5Os92_| _GBRITG_| GATGTT | 


PRR2 | 

PDiF-1____|[ 20801163 | 27373382 | 9347160 | 3095801 | PTIsBI | 2636 | —— | 
PDiF-2 || 20807163 | 27637925 | 950019 | 3103897 | 21768B6 | 212928 |___—— | 
PTTM___|[ 20307711 |_s163685 | 157252 | 464483 | 273088 | 207599 | 192060 | 
-XTM___|[-20306500_[ 209580 _| 1896477 |_AroaI3 | 256521 | 152687 | 158708 | 
-XTM-C__|[ 20300837 [_si94643 | 1464410 | 495743 [283201 | 263703 | 358852 | 


Table C.11: AQ(0.01) - t, = 64 Cycles / Flit-Hop — Running Times (cycles). 
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C.2 FIB 


C.2.1 t, = 1 Cycle / Flit-Hop 


a rb of process) 
Mer TA] 256 |TORT | 1006 | eae | 
[Free-Ideal [[ 035387 | in7oss | 407s | eis | oil | sus | 5412 | sii | 
[P-deal___|[ 652439 | i1si509 | 0068 | i011 | _9000 | 7615 | r597 |_7615 | 
[ C-ldeal-1__|[ 652457 | ie8ar7_| 47300_| 17914 | 10660-9565 | T0007 | T060T | 
[ C-ldeal-2__|[ 652457 | 1esrid [avert | 18885 | 10564 [9907 [9701 | 10498 | 
PRR-1 [652439 | 168010 | s1950_| 23679 | 19527_| Tres2 | Te455 | —— | 


PRR-2___|[ 652439 | T69d7 | ria [23rd | 19366_[ 20220 | 20292 | —— | 
PDirt1___|[-s64r25 [187087 | s9471_| 40850 | 40976 | 40076 | --| ——| 
PDirk2__|[-s6dra5 | ras7s3 | 59769 | 29616 | 308m | 291r2 | —-| ——| 
PTTM || 652430 | rres2s | sosi1_[ 25182 | 18826 | 18883 | 19981 | T7028 | 
PXTM___|[-652030 | rr40ss | 4483_[ 23898 | 2097 | 1856 | 20204 | 17600 | 
PXTM-C__|[ 652439 [184850 | erre7_[ 33709 | 33980_[ s02t7 | 53639 | 57930 | 


Table C.12: FIB(15) — Running Times (cycles). 


| p (number of processors) 
[Mer oT tT TH] A | 4096 | 16384 | 
Free-Ideal_[[ 70ss0s | 174573 | 0000 | 1001 [sits | 19100 | 771 | 7353 | 
P-Tdeal || r2daaia | 198825 | sireai_| 135793 | 30203-16100 | T0926 | 10432 | 
| C-ldeal-t_|[ r2t4a02 | ta10084 | ao1306 | 125020] 40977 | 19150] 148a1_| 15353 | 
| C-ideal-2 || r24aa02_| 1a2isi2 | 46670 | 126670 | 40058 | 18865 | 15906 | 15050 | 
PRR-1 | 7idaaa_[isiesoa | ariaio | 12426 | c2004 |_sariT | _siaon | | 
[_roaaaaa_[is20284| aeatad_[ aris [_Taiae |_sa001 | sari | 


Ppin-1 || 7373624 | r9srr24 | 526230_[ 160245 [T1990 | Teo | -- | ——| 

[7373624 20za501 | sa2ds2 | reai49 [ona [92985 [_—— [=] 
PTT | aa [saver | aoa [1677s [62914 [40510 | 29809 | 25988 | 
PxXTM___|[ 72a 1815089_[ as75et | 1o5t52 [76146 [44160 | 30408 | 34245 | 
DXTM-C__|[ 72444 [1909502 | s7ai92 [198382 [96755 [73182 | T5691 | 224980 | 


Table C.13: FIB(20) — Running Times (cycles). 
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| 0 
[Me AT tO |G |_ 1880 | 
[ Free-Ideal [| 78250282 | 19331725 | 4740547 | 1179706 | sooTer | So1s2 | 24842 | 11750 | 
[P-Ideal__| 80350904 | 21970404 | Soo7se1 | 1444026 | aoser7 | Too119 [33108 |_I7ir8 | 
| C-Tdeal-1_ || 80350922-[ 20095167 | 5036588 [1282851 | 337300 | _96459_ [37280 | 23350 | 
| C-Tdeal-2 || 80350922 [ 20106973 | so457OT_| 1280106 | 338657 | 9706 |_3R695 | 21431 | 
PRR-1 | 80350904 | 20096730_[-sos2r70_[ 1317302 | 406706 | 187035 | 178050 | __—— | 
PRR-2 | 80350904 | 20108332 | 5108362 | Ta58515 | 434966 | 2as167 | 2227001 | __—— | 


Ppirei___|[-sirrrse2 | 2is20a18 | sersirr | 1460149 | 4enod | 4a7e9 | -- | _——| 

[sirrrse2 | 22609164 [5959726 | 1562207 | 44102 | adso9s | _—-[ | 
PTTM || s0s50904 | 20141997 | si69924 | Ts92524 | 437969 | 160430 | Tata | aRI3R | 
-XTM____|[-s0350904 | 20rs2R93 | sras7od | 1390049 | 44279 | 20574 |_9I8 | T2473 | 
[XTM-C__|[-sos50904 | 2448i883 | 845312 | 1727970 | 609347 | 256368 | s99T27 | 453817 | 


Table C.14: FIB(25) — Running Times (cycles). 
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C.2.2 Variable t,, 


| p (number of processors) 

[Mer TE tT A | 4096 | 16384 | 
PFree-Ideal_[[ 70ss0s2 | 17issi1 | dais | 10901 | 31905 | 11900 | 7559 | Tat | 
P-Tdeal || r2taaaa | 1980507 | sis016 | 135500] 40s | 1706 | 11760 _| T1202_| 
| Coldeal-i_|[ 724da02_[isiaso7 | ao4a96_| 120800 d0s05 | 21158 | 16490 | 18502 | 
[C-Tdeal2 | Toaaae2 | Tea0so1 | aea0Ts | Taro | teas 217 | TasaT_| —1TOTE | 
RRA Teaaaaa_[ exam | area | tastes | rapa | a0 | amas | | 


[rae _|a28d7s | 4rs125 | 150398 | 84345 [71391 | 75580 | —— | 
Dir ___|[-ra7368s_[ 2107838 [572836 | 1re2n | Tasa13 [Tosi | --| _——| 

[7373688 | 2167362 | 592608 [171304 | 116465 | 103883 |__——[_——| 
PTTM | 7244444 [1841130 | 491086 [158799 |_ortes [457d [_aTTas |_SOTIT | 
PXTM___|[r24aaad [18ert59[ 49asts [151804 |_oosi2 |_444e6 [34242 [30289 | 
PXTM-C__|[ 7244444 [r96r428 [597400 [20657 | Tos27s [86053 | 178461 | 171457 | 


Table C.15: FIB(20) —t, = 2 Cycles / Flit-Hop — Running Times (cycles). 


a ber oP processors) SSCS——S* 
ea 
Free-Ideal [| 705078 | 179182 | 432740 | 110665 | 31683 | T1952] 8003 | 7912 | 
[P-Ideal || 7244480_[ 2010908 | 523937 | 139389 | 41342 |_1r6sd | ToRD7 | 12977 | 
[ Cldeart_|[ 7244408 [1819160 | 464584 [128870 | 42968 [28543 | 20891 | 25006 | 
[ CrIdeal-2_|[ 7244408 [1821243 | d68578 [126515 | 43370 | 24341 | 20821 [25925 | 
PRR-1 | 7244¢70_[ 1824709 [ 474802 [15249 |_s7803 |_s0588 | 79072 | — | 
PRR-2 [7244470 [ 1837509 [ a80901_[Te1sti_| Ti957s | tio1s7_| 1267 | —— | 
DiI |[ 408018 [23rt267 | ori0s6 | 208648 | 160240 | Tes7ar | —--| _——| 
PDi-2 | 7408018 [2r24008 | o9atas [205738 | e446 [130195 | _——| _——| 
PTTM [7244562 [1837908 [497081 | T6s188 [69283 [4931 | B87aT | AGA | 
PXTM____|[r244572 [1821505 | 481610 | 150663 | 75886 [_SIo16 | 44460 [38472 | 
PXTM-C__|[7244990_[ 1948942 [558031 | 197515 | 123906 [118058 | 194132 | T7ORes | 


Table C.16: FIB(20) —t,, = 4 Cycles / Flit-Hop — Running Times (cycles). 


a rib er of proceso) 
[Me | 0G |_| 
[Free-Tdeal [| 7055130 | 1780261 | 4a3is9 [11205 | _srri9 | _ias7r | 8936 | 8872 | 
[P-Ideal__|[ 7244516 | 2019017 | sevr71 | 140016 [42754 |_20086 [T5818 [16647 | 
[ C-Tdealt_|[ 7244534 1820232 | 405783_[_ 133385 | Arr | 26280 | 26155 | 33170 | 
[ Crldeal-2 || 7214534 [1826416 | 467037 | 130588 | 48007 |_30072 |_27037 | 39954 | 
DRR-1 || 7244522 [1827965 | 490870_[ 179071 | 108505 | 113288 | Tos7a2 | __—— | 

| 72a4522_ | 1850326_[ 50874 | 2101s | 171235 | 165516 | 157364 |__—— | 


Dpitt=1 || 7373992 [2802551 [858206 | 281516 | 213001 | 208463 | --| _——| 

[7373992 [2917805 [875363 | 298488 | 240687 | 236879 |__—-|__——| 
PTT || 7244652 [1848180 | 524020 | 181886 [80810 |_o5572 |_7eTr | 70535 | 
PXTM___|[ 7244662 1836190_| 495480_[ 159742 [—Tosa8 |_s3176 [40310 | 47497 | 
XT M-C__|[ 7245441936986 | 60s019 | 210688 [119359 | T4T37s | 198389 | 270567 | 


Table C.17: FIB(20) —t, = 8 Cycles / Flit-Hop — Running Times (cycles). 
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ar rier of proceso) 
[MeL] 
[Free-Tdeal [| Toss34 | rrosois [452382 [17620 | sto | 14341 | 10598 | 
[P-Ideal__|[ 7244624 | T92190¢ [_so0rrs | 135957 [45492 [25199 [22033 | 
[ Grldeal't_|[ 7244642 [1824620 [—do0t2 | iasat7 [54192 | s4ad8 [42492 | 
[ C-ldeal-2 || 7244612 [1836309 [475337 | 138790 [56358 [ 41800 [56770 | 
PRR-1 [| 7244626 [1868163 [523080 | 212272 [159375 | 148379 | 153555 | 
PRR-2 || 7244626 [1934930 [620762 | 332189 | 275455 | 250806 | 222271 | 


Dpire1 || 7s7a786_[sesz7or | reisoss | a200e7 | aasri6 | 327626 | —— | 

[7373786_| 3902435 [1249959 | aosa7s [311929 [307490 | —— | 
PTT M || 7244850 [1878020 [S353 | 191020 [T1846 [95326 [T7295 | 
PxTM___|[7214850_[ 1855064 | sa7298 | 20207 | 103859 | raa19 | 6ordd 
PXTM-C__|[ 7214634 | r94ro0r | so0ad | 278499 | 305830 | 244921 [390895 | 


Table C.18: FIB(20) —tn = 16 Cycles / Flit-Hop — Running Times (cycles). 


a ribo oF processors) 
a 
[Free-Ideal || 7130049 | 1834009 | _4o7s88 | T8719 | 42126 | 21906 | 21585 | 
/P-Ideal__|[-7324486_[ 1865705 |_499085 |_15667 |__82806 | _S8301 | T7426 | 
[ Grldealt_|[ 7316384 [1853647 | 493737 [151475 | __79889 | 69600 | 97937 | 
 Crldeal-2_|[ 7316384 [1866739 | 502636 | 160024 [__B2982 | _ T6475 | 112934 | 
PRR-T || 7324510_[ 1958795 | 699563 | 420188 [352456 | 380028 | 345677 | 

[7324510 [2075863 | —e8aes1 | 140963 | 476833 | 492665 | 435985 | 


PRR2 

Diner || 7425235 | s097839_| aosaa27 | Ta1ra13 | T5688 [249922 | —— | 
PDin-2 || 7495235 | To0s9369 | as2q9a1_| 1921262 [1407s | 1132898 | —— 
PTT M [7325460 1905376 |_Grs0s9 | 284587 [179006 [198796 | 210055 | 
xT M || 7331310 [1868209 |_—s87s10 [229439 [156843 [128660 | 123495 | 


Table C.19: FIB(20) —t, = 64 Cycles / Flit-Hop — Running Times (cycles). 
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C.3 TSP 


C.3.1 t, = 1 Cycle / Flit-Hop 


| p (number of processors) 

| 
[FreeTdeal [[ sorais0 [ sisorr | Taasno [srorr | ros | Toro | Trois | 
PPoideal —|[s0a5s70_[ estar isrera [aries aosri [iar i190 
C-Tdeaki_|[ saa5n68 | SRO2R5 | T4ReIS | JoIa1 | 29205 | aas1_| TBO 
[C-Tdeal-2_|[ saa5ne8 | SoRdsR_[ 14Tssd_| A7eId | 20RR0 | T4800 | THOS 
PRR-1 | sass70_| s7oTes_| Ts250d_| 56002 | aa108 | 2RIeS | HOGIC 


PRR-2___|[-3635870_[ 590096 | 54367 | 55765 | B1167 | 20900_| 32736 | 
PDiti___|[ 3701646 | rors6 | iros23 | 73824 | 70035 | e872 |__| 
pie || 3701646 [622469 | irse49 | 62027 | 65226 | 61838 | —— | 
PTTM___|[-s635870_[ras756_| 156168 | 55840_| 29353 | 24160 | 207RT | 
PXTM___|[-s6s5870_[ 599244 | Taste [58582 | 39826 | 34023 | 30325 | 
PXTM-C__|[-ses5870_[ so5614 | irisss | 74057 | 50162 | 40802 | 49346 | 


Table C.20: TSP(8) — Running Times (cycles). 


a rmber of proceso) 
a 
Free-Ideal [| 12188860 | 2810166 | 600440 | 179553 | S018 | 19740 | 24925 | 
P-Ideal || 12613984 | 3313219 | 910147 | 222944 | 6170s | 22862 | 24671 | 
[ CIdeart_|[ 12614002_[ 3257010 |_s67r10_| 218874 | 60053 | 24028 | 2192 | 
 CrIdeal-2 || 12614002 | 3068803 [813505 | 222760 [sat | 24771 | 26176 | 
PRR-1 || 12613984 [3254456 [7752 | 237613 | _so07 | 64905 | 6ITI2 | 

[2613981 [ar2529 | s20ees | 243129 |_94aTT [59541 | 66103 | 


PRR2 | 
PDin-1 || 12839426_[3633033 | Tos80s1_| 259083 | T5109 | T5996 | —— | 
PDin-2 || 12839126_[ 3189390 | 966454 | 265024 [133908 | 135128 | —— | 
PTTM | 2613981 | aosr367 |_s620a1 | 226t16 | 80278 | 40875 | BISI5 | 
-xXTM___|[ 12613981 [3012289 | st5se8 | 272000 | s2588 | 60266 | 39302 | 
PXTM-C__|[ 12613981 | 382i205 [959277 [273300 | 125548 | 68i58 | 89273 | 


Table C.21: TSP(9) — Running Times (cycles). 
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a riba of processors) 
[Mew «OG | 
[ Free-Ideal [[ 40284900 [9170942 | Disaoi2 | sirea | 1p919 | 42270 | 17978 | 
[P-Tdeal_—_|[-41708532_[ Tos0sr92_| 2605097 | 685780 | 184424 [51549 | 23096 | 
[C-Tdeal-1__|[-41708550_[ Tor98072_| 2463512 [635929 [171185 |_s2522 [25641 _| 
[ C-Tdeal-2__|[-41708550_[ 10248470_| 2457766 | 627366 | 167683 | _s4446 [26210 | 
PRR-1 || 41708532_[_ 9996082 [2532657 | eoro7d | 243249 | 141835 | 135856 | 
PRR-2 || 41708532_[ Toor7s95 [2547202 | ep rs6r | 222340 _| 109408 | 113071 | 


DDirk-1___|[ 42453008 | 1075891 | 2a284s2 | 733168 | s01146 | 283301 | —— | 
PDite2 || 42453008 | 11890537 | aB76274 | 762911 | 246536 | 244190 | ~~ | 
PTTM || 41rosss2 | rossorr1_| 2493742 | 668676 | 217039 | 78791 | 48531 | 
PXTM___|[-4rrosss2_| 10288864 | 2489987 | 66153 | 218532_| T1384 | 78960 _| 
xT M-C__|[arrossa2 | rosrosnt [2785668 [774362 | 293431 | 128278 | 158858 | 


Table C.22: TSP(10) — Running Times (cycles). 


| p (number of processors) 
[Mer TE 25 | 1024 | 4096 | 
[Free-Ideal_ [| 3150a0750 | To02s806 | Liisoid7 | 3044050 | o1s0ad | Tora | 57253 | 
[P-Tdeal || 320054302] o7sea229_| 15875151 | 3913158 | 1009180 | 261798 | 85020 | 
C-ldeal-1_|[ 326031380] coso0e79 | 1aaes2i1 | 3640721 | 9sdi95 | 257s12 | _BB0TT | 
[Caldeal-2 | 320034380 | oo7s0003 | a4aG0eT | ssdaRGa | —oBIAI0 | BsPRB0 | STOR 
PRR=1 | s200st362_| aoa0sei0_| 1aasosed_[ so1atie_| T9dTs0_| GORTGO_| A775 
[Ls2s0s362_[ sososis0 | iadsami | as05015 | To0eda6_| 415116 | s25057 | 


PRR2_| 
DiI [331848858 | 80618967 | Is909t77 | Tiis066 | 1102629 [TIBI | —— | 
P Dia [331849858 | 92314330 | 16588280 | A5eK20 | T14s8I7 | 670868 | —— | 
PTTM____|[-326034362_| 75168863 | 17027508 | 4171091 | 1041265 | s82591 | TAR68R | 
PXTM____|[-326034362 | coss6o18 | 14520085 | 841879 | 1089472 | 450306 | 247845 | 
XTM-C___|[-326034362_[ 7as48t6s | 19289072 | qae2se8 | 450355 | 555820 | 377919 | 


Table C.23: TSP(11) - Running Times (cycles). 
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C.3.2 Variable t,, 


| p (number of processors) 

[Mer oT tT TG] 1024 | 4096 | 
PFree-Ideal [| 10281000 [| oironor | 2iss003 | simoa1 | Tiooi9 | 43519 | 100i | 
[P-Ideal [| 41708532 | Tosisiod_| 2823014 | o8a7a?_[ 177128 | 53554 | 29800 | 
[ Caideal-t [37208650 | soa8a08 | 227T9G1 | sADDRG | ISTIsD_| 51602 | —2TRaT | 
[CeTdeal-2 || 3726n056_| O1RITO0_| BaNsIaG_| STAI] TsGoOR | SORT [27502 
PRR-1 || aitons2_| ToxrToet_| saorrar_| araeai_| soweda | 179227 | TRH 


[41708532 | To286993_| 2608425 | G663625 | 258329 | 156383 | 149001 | 
Dirt |[ 42452026 -[ 11835760 | 306863 | 794801 | 328435 | B0791 | — — | 
PDin-2 || 49452026 | T295460_| 248099 | sti03 | S284 | 330177 | —— | 
PTTM || 41rosss2 | 1or20142 | 2597085 | 702539 | 218844 [|_91096 | 46015 | 
XTM___|[-4rross32 | 10210999 _| 2522599 | 665090 | a5r221 | 130872 | _T98TS | 
PXTM-C__|[-4rrosss2 [10523375 | 2799160 | 782442 | 209768 | 141508 | 128238 | 


Table C.24: TSP(10) —t, = 2 Cycles / Flit-Hop — Running Times (cycles). 


a rib of proceso) 
[MeL COO | 
[ Free-Ideal [| 40281926 | 9030180 | 2isdr74 | sti963 | 143008 | 41477 | 19069 | 
[P-Ideal__|[ 37268674 |_975a003 | 2602473 | e2ai67 | 164079 | 49003 | 24735 | 
 G-ldealht_|[- 37268728 [8930314 | 2312564 | 583627 | 159785 | 54946 | 28877 | 
[ C-ldeal-2_|[- 37268728 |_9T76764_| 2329364 | 568508 | 155500 | 53938 | 29785 | 
PRR-1 || 37268680 | 9206744 | 2921696 | ors24d | 280451 | 226709 | B14TAD | 
PRR-2 || 37268680 _| 9218272 [9358436 | 659935 | 283511 | 200450 | 186491 | 
PDir-1 || 38001678 | 12072635 | arra203 | 852180 | 383371 | 87760 | —— | 
DDit-2 || 38001678 | 12072548 | 3270647 | 870302 | 345004 | 338349 | —— | 
PTTM || 37208756 | 9203473 | 2451967 | 6608s | 203593 | 91821 | SIB9T | 
PXTM___|[ 37268756 [9235892 | 2936277 | eBse71 | 246558 | 117606 | 75326 | 
PXTM-C__|[-37268740_[_9375229 | a5a827s | 717532 | 284910 | 128558 | 189255 | 


Table C.25: TSP(10) -t, = 4 Cycles / Flit-Hop — Running Times (cycles). 


ar rier of process) 7] 
[Me CT OO | 
[Free-Tdeal [[ 40284978 | o211855 | 2244573 | sa5730 | 142092 | 449R5 | 20207 | 
[P-Tdeal__|[ 37268782_[_ 9825104 | 2620276 [629202 | To6593 | 50832 | 25689 | 
| C-ldeal-1__|[ 37268836 [8940971 [2254033 [581546 | Tro192 | 58026 | 33841 _| 
[ C-ldeal-2__|[ 37268836 [9515668 | 2337640 [571036 | Trovos |_57057 | 33069 | 
PRR-1___[[ 37208784 [_o194751_[ 235n986 [710130 | 338604 | 326150 | 312106 | 

[37268784 [9213020 [2326764 | 7asa83 | seas | 334363 | 257655 _| 


PRR2 
DDife1 || 37933850_[ 14963301 | 4036970_[ Troost | so7669 | s35733 | — — | 
[Dif-2 || 3793ae50_[ 14ros0r2 | ai4aa77_[ ti91087 | 480872 | i389 |__—— | 
PTTM ___[ sr26so18 [9219140 | 2372ae1 [662801 | 214460 |_ 95477 | 73486 | 
PXTM___|[a7268918 [9224555 [2349906 [616615 | 220696 | 96208 | 82083 | 
PX TM-C__[[-37268830_[ 9365410 | 2547s01_[_7es288 | 255896 | 126161 | 120770 | 


Table C.26: TSP(10) —t, = 8 Cycles / Flit-Hop — Running Times (cycles). 
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| 
[Med COT OG | 
[Free-Tdeal [[ 38269172 | _s950024 | 2230310 | sas726 | 43442 | 43046 [21654 | 
[P-Tdeal___|[ 36262376 [9675295 | 2403593 [627066 | T4049 [S184 |_28741_| 
[ C-ldeal-1__|[ 36262430 [8941216 | 2261960_[ 594370 | TesoI [61959 | 11406 | 
| C-Tdeal-2 || 36262430_[ 9169046 [2288068 [—s7a868 | Trov2 [64228 |_A237F | 
PRR-1____|[ 36262370 [__s9534d2 [2298015 | TosI6s | 433603 | 387426 | 401201 | 

[-36262370-[ 9196652 [2382659 [823730 | sOl4ad | 434040 | 454685 | 


PRR-2 | 

Dirk || 36089381 | T2tri08 | ss91049 | Te016s4_| 879497 | 794900 | —— | 
PDirk2 || 3608931 | i96t2a18 | 6200907 | irrsroz_| s5i6ad | es57a7 | —— | 
PT TM || 36262602 | 9038395 | 28rs249 | 697930 | 287582 | 128701 | TOSI | 
-XTM || 36262602 [9288592 | 2383363 |_627707 | 236523 | 128507 |_S7II6 | 
DXTM-C__|[-36262409 [143246 | 2ioare | rraeis | 285475 | ree [117284 | 


Table C.27: TSP(10) —t, = 16 Cycles / Flit-Hop — Running Times (cycles). 


| p (number of processors) 
[Mer TE tT TG] 024 | 4096 | 
FFree-Ideal [[ 35747301 [| s0o1550 | Dis7500 | sisssd | Ido70T [40450 | 27100 | 
P-fdeal || 35125384 | 9015493 | 9440341 | o4126_| ira7i8 | —o7s0d | 48381 | 
C-ldeal-1_|[ 35139827 | 8138300 | 9233736 | 00085 | 192904 | T01322 | 0810 | 
[Caldeal-2 35130827 | erases | _aoaara4_| 619288 | 204783 | 120543 | TOITET_ 
PRRt__|[asiosaa_[—awoorae | osonaea | —oa7raa | 0006 | a7O950_| 67IUTD_ 
[Lssiasae2_|_o0TIaaa_[siosea1_| Tszssa1_| TaT061 | T60s72R | 9ATGTO | 


PRR2_| 

DiI |[-s57i28e1_[ 44713661 | Te2sssi1 | -aB70927 | 2490625 [2493885 | —— | 
[35712861 | aoos2e58 | I547sa47 | 4958644 | 2593371 | 2738879 | __—— | 

PTTM [35126186 [_90ses47 [2505409 | seer10 | 423920 | 260616 | 336035 | 

PXTM____|[-s5126186 [9027484 [2530795 | reo966 | 324328 | 262660 | 225875 | 

PXTM-C__|[-35125796 [9169307 [2645049 | s81426 | a51001 | 378891 | 386189 | 


Table C.28: TSP(10) -t, = 64 Cycles / Flit-Hop — Running Times (cycles). 
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C.4 UNBAL 


C.4.1 t, = 1 Cycle / Flit-Hop 


a mb oF processors) SiS 
OZ 
[Free-Ideal [| 01423 | Is0ns2 | a7es6 | 9712 | 2791] 1193 | 10o1 | 1041 | 
PStat___|[-sora51_[asini6 | 38503 [tosis |aear [ori | —-[ --| 
[P-Ideal__|[ sora51_[ 1rer9 | 3975 | 11901 [14a [tas | -- | -—| 
[ Crldeal-1_|[ 601469 | 252191 | 3453 | 48208 | 49236 | 48179 |_—-[ ——| 
[ Cldeal-2_|[_s01469_| 150932 | 39609 [11790 [4650 [3082 [3215 [_Sa18 | 
sorts1 | 219990_| Sris7_| 48730 | 52629 [eos | —-| ——| 
Peorts1 | 151408 | 41001 | sear | os | 9968 | oer | | 
[612669 | 165953 | 49035 | 42006 | 41461 [arte | --| —~| 
[s12669-[174799_| ae8a1_| 38377 | sexe1 | sese1 | ——| ——| 


Table C.29: UNBAL(1024) — Running Times (cycles). 


a 
Mar] | 286 [0d | 4006 | 1088 | 
PFree-Tdeal | 2404687 | 601368 | 150560 | a7888 | 9838 | 954 | T3387 | 1213 | 
PStat____|[ 2404715 [01982 | 151207 [sso | toss | 4573 | 4243 | —— | 
P-Ideal___|[ 2404715 [688383 | 173725 | 4472 | 4ases |---| --| -—| 
[ Gatdeal-t_|[- 2404733 [7005902_[ 331930 189530 | 19540 |---| —— 
| Cldeal-2_|[ 2404733 [~ 6orras | isar27 | t1090 | 1224 [aise | 3420 | 4158 
PRR-1__|[ 2404715_[~ 876055 [322985 | 190020 | 93eed |---| ——| 

[2t0rris | _coz221 | Tas512 [44082 [19187 | 19303 | 19827 [| —— | 


PRR2 | 

P Die] 2448707_[55806_[T75t43 | s2R03 | S281 | ware | --| —— 
Ppirt2 | 2448707 [_o1825 | 183335 | 77300 | 79073 | 78355 | —-| _——| 
PTTM | 240q71s | 60to55 [154410 | 4036 [_rrt50_| 10850 | Toad | 10680 
PXTM___|[ 2404715_[~603909_[ 168387 [50501 | 23673 | 1772 | 11159 | 10280 
PXTM-C__|[ 2404715 [oasis | i5s607 [54386 [3008 | 19281 | ras | 17765 | 


Table C.30: UNBAL(4096) — Running Times (cycles). 
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| p (number of processors) 

[Mer oT tT TA TG | 1024 | 4096 | 16384 | 
PFree-Ideal [| so1?ri3 | 20de32 | oo1s76 | 15052 | 3801 | 9908 | 3098 | 1100 
Stat serrr7_|-240n196-| 02023151393 _| 39057 [i617 [045 | 7343 
 p-Ideal_—|[serrr7_| 2752767 | _ooaras_[ 175059 Tearer | -—|~ --] 
-C-deal1_|[-9617789-| 4021080_| 1921857 | Tadoas_| Tad70s_ | ——— |---| 
PC-tdeal-2 | 9617780] 2405012-| 904920 150373_| 42418 | 1268 [5056 | 5009 
PRR1 | ser7r7i_[-ssoa036 | TaaToid_| tanita | Ts07ei |---| ——| 
PRR-2———|[osrrrri_[ aansase | soriie| 150012 | s04a9-[—asis1_| 15700 ———| 
Pie ——|[-oroosss[-zorsssa_ | on0a67| 194927] Tasa19_[ Tonary_|——[———] 
Hpi ——][-oroosss_[ arsrasa_|—razast_| 192t00-] taasa7 | tsa2Ts_[ ——[———] 
PPEM——[osrrrri_[ aavss7s | sos0d7_| 156r20|_aars0_[ 180T9_| 13028 | Ta276| 
-xtM——|[serrrn_| 2aore72|—ssser9_| 176066 —60d98 | 32323 _[ 17090_| 14391 _ 
xT M-C—|Pserrrrt | 2a0s167 [011832 [-20s270-[ T7201 [40920 -[ 28618 | 26850 


Table C.31: UNBAL(16384) — Running Times (cycles). 


a ber of proceso) 
Ce | 
[Free-Tdeal [| 38409967 | oorTos® | 2404610 | s0l40s | Iso7i® | seid | 101d2 | 3227 | 
[Stat____|[ 38469905 [9618252 | 2405287 [602209 | T5161 | 39793 | 13089 [9265 | 
[ P-Tdeal | 38469995 [11010295 [2772303 [699130 | esezs1 | --| --| ——| 
[ C-ldeal-1__|[ 38470013 | 16082583 | -s2846e7_| 3015695 | aoIs695 | —-—[--| ——| 
[ C-ldeal-2 || 38470013 [9618068 | 2109301 [609976 |_I60dI5 | 49098 | _Ta5I9 | 67oT | 
PRR-1___ | 38169905 | Tors619 | s1443a1_[ 3016122 | 3020815 | ——| 


PRR? | seionmns [sores [orisras [planes [Tron amos ae 
Ppie___|[-s9159316 | toMarri | 2699916 |_ro5rd6 | _s26n55 | s2rra7 | --| -—| 
Ppirt2__|[-s9759316 | 10836040 | 2Bsi729_|_ra5s33 | _sissr7 | srrear | __—-| _——| 
PTTM __|[-3sis9005 | 9e22751 | atoo16 |_eri49s [161829 | 48019 | 22966 | T6DTO | 
-XTM____|[-38469995 [9620153 | 2581413 | 663652 | 220172 | 86x45 | 31568 | 23082 | 
PXTM-C__|[-38469995 [9621468 | 2120705 [679842 [225614 | TOSI6O [BIST | 42882 | 


Table C.32: UNBAL(65536) — Running Times (cycles). 
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C.4.2 Variable t,, 


| p (number of processors) 

[Mer oT tT TT 286 | 1024 | 4096 | 16384 | 
PFree-Ideal [ so17ri3 | 2101639 | oois7r | 1s0se7 | se0d7 | 10027 | 3101 | 1430 ] 
[Stat [9017771] 2405193 [02025 | 151303 | 39057 | 11017 | o3a7 | 7028 | 
[ P-Ideal__|[ aoirrni_| orsa7o1_|_oaa7ai_| 175705 | tara | --|--| ——| 
[C-idealt_|[-oei77a_| dorisaa | taeid02_| Tada7s | TaaTes | _———| | ——] 
[Crldeal-2_|[-asrr7a9_| 2405040 | _sueTIs | Ta0as9 | a2102 | TNR _| Gao | OR | 
PRR] osrr77_| sssao7a | taaasa7_[ soea_| Taig |_| _—-] 
PRR2__|[osir77_| 2a0ssa1_|_sorrae_[157020_|_s0ea3 | —s0s20_| 51365 | ———] 
Piet _—|[-aroonss_| 27reine | Taia7a_[ oisiee_| Taawos | Teams | | | 
Ppite2——[[-9790833_[ 2920502 [779836] 212838 | Trevis | 17217 | _——| | 
Prem [serrrr | 2a0orr |_o1ides | 158732 | aeieo | 19116 | 15863 | 15170 | 
PxtM___[oorrrr [20007 [o1eis | 170934 | co080 | 30003 | 18385 | 14383 | 
PxTM-C__[[sorrrmi | 2408ae0 | o1asai | 183054 | rai [30057 | 30331 | 30000 | 


Table C.33: UNBAL(16384) —t, = 2 Cycles / Flit-Hop — Running Times (cycles). 


a ber of proceso) id 
[Met] | | 098 | ITT | 
[Free-Ideal [| 9617769 | 2404671 | 601403 | t0ss7 | s7997 | 9085 | 2991 | 1380 | 
PStat____|[ 96rrro7_[ 2405228 [602051 |_tidi9 [30203 | 11997 | 6919 | 8746 | 
P-Ideal__|[ 9617807_[ 2752863 | _6BRSI | Irov4T | torse7 | --| _--| -~| 
[ GrIdeart_|[ 9617825 | 4170649 | 421874 | 7oares | rodse7 | _--[_--| ——| 
[ Crtdear-2 [9617825 [2405126 [607960 19723 | DIT | TaTas [Bees | THE | 
PRR-1__|[ 901r7o7_[ 3924496 | 1820303 | 1092005 | S47 | —— | 
PRR2_| 


Dserrror Dames aorirt [rao aams [ones | oe | 
DiI |[-9790839-[s101269 | sezs72 | 255080 [221957 | BaI97s | --| ——| 
PDi-2 || 9790839 [ 3207796 | sTGR0 | 253320 [212636 | BIZ6TT | _——[ _—— | 
PTTM___|[ 96rre8o [2405702 [_6ro2 | 156291 | 48526 [ 23047 | 18998 | DIN8d | 
PXTM___|[-9617899[ 2407064 [612264 | triisi_| 70299 [_s0z71 | 19512 | 16630_| 
PXTM-C__|[96rrsi7_[ 2410904 [_s2tsis | 194se0 | eort7 | 38731 | 34593 | 30769 | 


Table C.34: UNBAL(16384) —t, = 4 Cycles / Flit-Hop — Running Times (cycles). 
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| p (number of processors) 

a CL 
FFree-Ideal | vorteo1 | oi0iris [corti | Isoss_[ —avoss | —oaTs [2027 | 1200 ] 
Stat arrars_[ 30200] —so2rri_|—taiseo_[—son93| —T2o7s [arts | T1008 | 
-P-Tdeal_—[[-sorrs61_| 9752909] soav29 [175050 | Tes09 | --] -- | | 
[C-Ideaki | sorrs07 | 4305503] 548553 [754050 | 55143 | ——--] —-- | ——] 
-C-Tdeak2—[[sorrsoT | 2i0n416 | G0at38 [159604 | —ana25 | —1Tost | 13000_| TTI | 
-RR-1 | 9or7a7s | 4514350] 2436029 | Te16026 | T5691 | --] --| -—| 
-RR-2 | sor7a7s | 2400931] coree7 | 162251 | _a5907 | e400] Sa197 | ———] 
Pier ——[[o7o0sa1_|-arsa5r2-| t1or8a9_| 334303 203790] oaa30 | _-—| ———] 
pinta |[-o7o0stt | saiaasa | t1s0623 | sated | ostaso | 2eazo7 [| ——[ 
PPrM——|[-oarre0r_[-s4r2s00|_soortr|—teats0_[—s7ert | _27a50_| a7arr | Bari 
-xTM_——[oarsorr_[s10ria2-|—siatox| —r60a52_[—rosas_| —soea1 30021 | 2521 | 
-XTM-C—[varrons_[-2110835 | —s2a021 [toasts [—rerr7 | —si200-[ sons | 5082 | 


Table C.35: UNBAL(16384) —t, = 8 Cycles / Flit-Hop — Running Times (cycles). 


a ber of processors) 7 
[Mer CC | 
[Free-Ideal [[ 9617925 | 2404825 [ oolss5 | 150735 | _asoal | 9871 | 2927 ] 
[Stat____|[ 9618031 | 2405679 [—so2e13 [152267 | 40495 [13909 [8447 | 
[P-Tdeal || 9618023 2758261 [_o9t29_[1ress1 | estar [__--|_——| 
[G-ideal-t_|[-9618059_[d7assis [1791979 | _rs5qi1 |_Tess7r | __--| __——| 
[ C-Ideal-2__|[-9618059_| 2405902 | Go8i50_[ 163851 | s2190 | 21361 [20516 | 
PRR-1 || 9618031 | sas617d | 3382293 | Dase2id | 200Ke39 | -—|[_ ——| 
PRR-2 [9618031 [2408169 [_G12522 [175200 [Tisai | T3040 | T1300 | 
PDit-i [9790835 | s0sso12 | Ta78660 [491322 [436759 | 4aq702 [__ —— | 
[Dit-2 || 9790835 _[sirrsas | Toisss2 | 4srso1 _4is4ea | aivia3 | ——| 
PTTM [9618249 | 2ars018 | orzis9 [172037 [__s9a7r |_42768 | 36009 | 
xXTM___|[9618139_[ 2410285 [—e27809 [193619 [81576 [39756 [37971 | 
[XTM-C__|[ 9618139 [2410051 [_e20739 [193416 [604d [44412 [39689 | 


Table C.36: UNBAL(16384) —t, = 16 Cycles / Flit-Hop — Running Times (cycles). 


a ribo oF processors) 
Met, 4, SSS C—O 
[Free-Ideal [| 9618197 | 2405385 | 602123 | Isis07 | 38003 | 10127 | 3383 | 
[Stat____|| 9618908 [2407995 | 05815 | 15009 |__45443 | _22059 | 28033 | 
 P-Ideal___|[ 9618995 [2754645 | 96277 |_1ro861 |_Tess21 | --|__——| 
[ Grldeal-t_|[-9619031_[ 6141369 | 2867639 | ToR6767 |_T9677 | __--| _——| 
[ Crldeal-2_|[ 9619031 | 2409300 | 620783 |_187209 |__Ti467 | _S07ST | T4050 
PRR-1___|[ 9618993 [716640 | 6079406 | 5169356 | TosoTE7 | __--|__—— 

[9618093 |2a18036 | e5a437 [233753 | 218088 | DTDs | DIOR | 


PRR2 

Diner || 9791855 T2srtre1_| 4a9a430_| 1433882 | Toraaa7 [273225 | —— | 
PDir-2 || 9791855 | Tais7is9_| 4arisr2 | 1418808 | 1295454 | ears | —— 
PTTM || 9619797 | 2430299 | es7ro1 | 215213 |_ 12762 [118838 | T2246 
-XTM___|[ 9619511 [2430879 | ea4e48 | 2aa714 [124675 | 85494 | 90431 | 
XTM-C__[[ 9619511 [2426418 |_erorr7 [219719 [_to7ss1 | 89765 | 95154 | 


Table C.37: UNBAL(16384) —t, = 64 Cycles / Flit-Hop — Running Times (cycles). 
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C.5 MATMUL: Coarse, Cached 


C.5.1 t, = 1 Cycle / Flit-Hop 


[9 tumiber of processors) | 
[MeL i, 4[ | 


T6587 
202748 | 73246 14552 
26328 
202756 29992 


C-Ideal-2 202766 85420 31749 | 30742 


PRR-2 | 


23976 
92857 
Ppire2 | 207812 | 94855 | 6651 | 56341 
21687 
DOZTAR D254 


Table C.38: MATMUL(16) (coarse, cached) — Running Times (cycles). 


| 
| 
| 
| 
PRR-1 || 202748 23556 
| 
| 
| 
| 
I 


| p (number of processors) 
a SS 


[Free-Ideal | 1506780 | 522831 | 151081 | 55353 | 33011 | 
PStat____|[_1596780_[ 475054 [130842 [49758 [29680 | 
P-Ideal___|[_1596780_[ 996878 [285914 [99288 [52622 | 
 Grldeal-t_|[_ 1596798 [519578 [154498 | 106256 | 38645 | 
[ Cxldeal-2 || 1596798 
PRR-1 || 1596780_[ 524739 [153245 [6261 | 62336 | 

| 

| 

| 

| 

| 


PRR-2 [1596780 
525900 


pir2 | 


Table C.39: MATMUL(32) (coarse, cached) — Running Times (cycles). 
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a rb of proceso) 
[Met] 4] | 4 [6] 0a | 
[Free-Ideal [| 2077708 | _soi2045 | 97sa07 | osiid | Tisorl | 78707 | 
[Stat_—_|| errr [3454014 [930719 [_ 272724 [101960 | 68808 | 
[P-Tdeal___|[ 2677708 [~Tioased_| 900267 [559206 [212970 | 127776 | 
[ C-ldeal-1_|[ 2677726_[ 3626061 | 98isi1 [_srf2ad [T2416 | 91185 | 
[ C-ldeal-2__|[ 1267rr26_[ 3626236 | 97s3a1 [_s87iid [127164 | 9467 | 
PRR || Brrr [3614334 [979545 [_a0Isa [154241 | 189835 | 

2677083644334 [980520 | s0r53¢ [178596 [180722 | 


PRR-2 | 
PDirk1____|[ 12905516 | 14956668 | 4150412 | 2Ar00r1 | Te90097 | T4s%242 | 
PDire2 || 12905516 [3962096 | 2ro3620_|_98s793 | _s27H95 | 490766 | 
PTTM___|[ 12677708 [3647009 |_979384_|_—s01944 | 129253 | 98591 | 
PXTM___|[ 1677708 [3643966 [980768 | s02552 | Ia2229 | T0946 | 


Table C.40: MATMUL(64) (coarse, cached) — Running Times (cycles). 
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C.5.2 Variable t,, 


| p (number of processors) 

| Mer Tt TAT | 1024 
PFree-Ideal_[[ 12077708 | soaTeer | 1iidsoz | 350008 | 171385 | 143073 | 
Stat || 12077708 | s0a2013 | t02srrs | 323144 | 137551 | 109061 | 
[ P-Ideal_—_|[207rros | _Tarso1 | piziass | _orais2 | 900156 | 210318 | 
[C-ldealt_|[ 12077726_| —anostaT_| T109eRG_| 5285 | _TanoTR | TTR | 
[Crldeal-2_|[ 12077726 | —anos00_| Trees | TadR_| TITAS | _—ToT2T | 


PRR-__|[ 120rr7s_[_a97isr7 | 1480 [37ers [222136 | 296526 | 
PRR-2____|[120rr7s_[_ao7isrr | roriei [381229 [279793 | 309723 | 
P Dire || 12905516 | Tost2535 | 49sa120 | arssri7_| 2308575 | 2803854 _| 
Pitta || 12905516 [_aerrarr_| a5ris91_| Todt [798020 | 1038068 | 
PTTM [12677708 [4003598 | Triss00 [38031 [183951 | 160732 | 
PXTM___|[ 12677708 [4003087 | 115895 [386395 [195183 [173381 _| 


Table C.41: MATMUL(64) (coarse, cached) —t, = 2 Cycles / Flit-Hop — Running Times (cycles). 


a rb of processors) 
[Me Tit 4] Ss] CO 
[Free-Ideal || 1oorrrai | ae71729 | Tae1i4s | 500790 | 269279 | 246655 | 
PStat___|[taerr7s_[_aorres7 | 1212061 [423626 [209041 | 192405 | 
[P-Ideal__|[ r2err7ia_[_ssz02r1 | a5rarai_| Te2s74 | 821318 | 623514 | 
[ Cldealet_|[ 12677762 | 4663054 | 1393226 [951210 | 298082 | 296650 | 
[ Crldeal-2_|[ 12677762_|_a47airs5 | 2552824 [530808 | 303810 | 355427 _| 
PRR | vera |_—araa0a7_[ 1375338 [sais | 364024 [524772 | 
PRR-2 | 10rd |_ara0a7_| 1373995 |_s3ean | _A77251 | 651379 | 
D Dire || 12905522_[ 20813250_| 6875081_| 4379999 _| 3537360 | 3922007 | 
PDire2 || 12905522 -[_soxera5 | s5s7104 | 1rs440r_| 1402090 | 1982218 | 
PTTM || 12077826_[_—aes7756 | Taseso7 | ss1672 [299655 | 280959 _| 
PXTM___|[ 12677836 | aee2078 | Ts9007e [532596 | S17R45 | 283898 | 


Table C.42: MATMUL(64) (coarse, cached) —t, = 4 Cycles / Flit-Hop — Running Times (cycles). 


a tribe of process) 
[Me TCC 
[Free-Tdeal || 12677786 | savers’ | 1as5001 | Tor24 | 473356 | 489304 | 
[Stat || 12errree_[_a909247 | 1573680 | _e22191 [352013 | 357029 | 
[ P-Ideal___|[12677780_[ 19367269 | —Go1r909_| 426264 | 2491020 | 2045032 | 
[ Grldealt_|[ 12677798 | soa5105 [1982253 |_ss61a1 [527621 | 542206 | 
[ Crldeal-2_|[12677798_[ 5961410 | _1918050_| 1401302 |_s02816 | 544580_| 
PRR-1 || 12077786_[_ 6181500 | _1908260_[_ 839083 [678275 | 1076066 | 


PRR-2 | 126rrre6_[_oisis00 | 191874 [852106 | s16666 | 1197341 _| 
Pitti || 12905508 [30206673 | Tisese7d | Birs276 | 822981 | TOIOT494 | 
PDit-2 || 12905508 [9620614 | sossi7o | 2803742 | aed4984 | 4614958 | 
PTTM || 12677916 | s093536 | —r9Ts809_[_ 838308 | _520080 | 43875 | 
XTM___|[ 12677926 sis9088 [1931450 [838568 [529658 | 575220 | 


Table C.43: MATMUL(64) (coarse, cached) —t, = 8 Cycles / Flit-Hop — Running Times (cycles). 
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C.6 MATMUL: Fine, Cached 


C.6.1 t, = 1 Cycle / Flit-Hop 


| p (number of processors) 
Mei, 4] 6] a 


30878 
206588 28785 | 18080 
28605 
206606 30552 


C-Ideal-2 206606 87614 | 46266 | 30994 


PRR2 | 


35718 
62538 
Ppire2 | 211181_| 106361 | 59980 | 4641 
36879 
205558 37482 


Table C.44: MATMUL/(16) (fine, cached) — Running Times (cycles). 


| 
| 
| 
| 
PRR-1 || 206588 32049 
| 
| 


| p (number of processors) 
a 
| 1612140 | 


[Free-Ideal || 1612140 
PStat___|[ 1612140 | 4sai80_[ 144385 [54702 [36922 | 
P-Ideal__|[1612140_[ 530486 | 244005 [95223 [51175 | 
| Grldealt_|[ 1672158 [527529 
[Cxtdeal-2 | 1612158 [527406 [207535 [83619 | 63370 | 
PRR || 1612140_[~ 54513 [ 192097 [87192 |_73708 | 

| 

| 

| 

| 

| 


PRR2 
T6417 


pir2 | 


Table C.45: MATMUL(32) (fine, cached) — Running Times (cycles). 
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i rb oF proceso) 
| CL 
[Free-Ideal [| 12739148 | a8aiei5 | 1120468 | 408020 | 215175 | 188500 | 
PStat_—_—|| i27a9148 [385580 | 94720 [283918 [112888 | 83608 | 
P-Ideal__|[ 1273 9148_[ 3673909 _| 1532039 [437126_[ T71883_| T2512 | 
[ Cldealt_|[12739166_[ 3657363 | 1136724 [309822 | TTOOTT | 136810 | 
[ Caldeal-2 || 12739166 | 3657230_| Iarsrrd [388444 | T8546T_[ 143820 | 


PRR [12739148 | sresi9d [1131171 | 389605 [193748 [232218 

i2rso148_[ 3763194 | 1130461 | 403760 | 233886 | 265457 | 
Die |[ 12968121 | 7534645 | 2246912 | Totter | 615466_| 577223 | 
PDie2 | 12968121 [4250445 [149874 | 566448 | 278983 | 338331 | 
PTTM | 12739148 [ 3810409 | 1157229 | 412028 | 209896 | 192461 | 
PXTM____|[ 12739148 [ sstois7 | 1144804 [406853 [214161 | 211595 | 


Table C.46: MATMUL(64) (fine, cached) — Running Times (cycles). 
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C.6.2 Variable t,, 


| p (number of processors) 

[Mer TT] | 1024 | 
PFree-Ideal | 12730108 | qs0700 | 1isisi5 | S071 | Sil0d7 | 395031 | 
[Stat [12739148 | 3093009 | 1oa2s07 | 33749 | 149an1_| 12497 | 
[P-Ideal | i27so14e | a0ss2a2 | 1802870 | —s9908s | 351095 _| 195476 | 
[C-ldeakt_|[1ars01e_| dosae2i | 1s0rToT | —so0o6 | Based | BOSIRS | 
[C-Tdeal-2_|[13Tsmi6e_| aoss92 | Ta9ss7a | —siaas | Bssi7d | aG00T | 


PRR [12739148 [4291014 [1376186 | 499651 | 28547 | 385371 | 

i2ra148_ [189308 | Tar7e7s [582338 | 399629 | 48087 | 
P DiI |[ 12968121 | s455841 [2595856 | 1478620_[ S1A83T | STISIO 
PDie2 | 12968121 | sisea50_[ 195917 |_S19888 | 485674 | 434751 | 
PTTM | 12739148 | 4243316 | 1469110 | 613282 | 32196 _| 292TH | 
PXTM____|[ 12739148 | 4oodse5 [rriod4s [597103 | 334786 | 396301 | 


Table C.47: MATMUL(64) (fine, cached) — t, = 2 Cycles / Flit-Hop — Running Times (cycles). 


a rb of processors) 
[Me Tt, 4, se | ak |e] 10a | 
[Free-Ideal || 127s91d |_paroess | 1930592 | 919830 | 595148 | 582113 | 
[Stat___|[ 1a7s9rrd_[_t10so28 | 1220619 | _aa6145 | 222657 | 207695 | 
P-Ideal__|[ 127s9184_[_a751970_| 2529022 |_s22453 | Aaz767 | 345164 | 
[ G-ldeal-t_|[ 12739202 | 4694431 | 1875179 |_737853 | 40737 | 367605 | 
[ C-ldeal-2_|[12739202_[_s107929 | 1899864 |_S24117 | 474256 | 460087 _| 
PRR | 1273017 | s0sd938 | irea2o1 | 7380ed | a61916 | 679077 | 

[i2ra0rrd | 5105081_[ 1887344 | 9rd65 [716178 | 034937 | 
Die | 12968147 | ross6rs1_| 3949960 | 1964292 | T118369_| 1711399 _| 
PDire2 | 126817 [7235875 | 294561 | Ta99re7 |_soa7t9 | BB3TT4 | 
PTTM || 12739266_[_s145965_| 2007646 |_938104 | 5230968 [502558 | 
PXTM__|[127s9276-[_sost0rs | 1936568 [904656 [_s7irs6 | 594734 _| 


Table C.48: MATMUL(64) (fine, cached) — t, = 4 Cycles / Flit-Hop — Running Times (cycles). 


a rb of processors) 
| 
[Free-Tdeal [[ 12739226 | 724817 | 2820581 | 1573203 | 1097S06 | T2149 | 
[Stat____|[ 12739226 4940097 | 1592199 [_ 636031 [368665 | 373035 | 
[P-Tdeal___|[12739220_[ 1276943 | 3474505 [84856 [942091 [679288 | 
[C-ldeal-t_|[ 12739238 [076809 | 2748770 | 11st658 | _r11988 | _67A0TT | 
[C-ldeal-2__|[ 12739238 [_67si27s_| 2811055 | 1413270 | 920755 | 95578 | 
PRR-1____|[ 12739226 [_orss202 | 2541415 | 1089071 | _S05q37 | 1234849 | 


PRR-2 || 12739226 [6097210 [3046785 | 1529806 | 1432862 | DI7H518 | 
PDif=1 || 12968103_[ 15091855 | 6980501 | 4039756 | 2802365 | 3605615 | 
Din-2 || 12968103 [10664163 | sea5a15 | sisrir2 | 2180743 | 2089678 | 
PTTM || 12739356_[_ 6951140_| oq2752 [149160 |_953380_| 96572 | 
PxXTM || 12739366 [o876345 | -330139_[Tsz071 [980527 | 1002713 | 


Table C.49: MATMUL(64) (fine, cached) — tn = 8 Cycles / Flit-Hop — Running Times (cycles). 
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C.7 MATMUL: Coarse, Uncached 


C.7.1 t, = 1 Cycle / Flit-Hop 


| p (number of processors) 
| 
| 34179 | 


34179 


[same 
[352508-| 468542 [157456 [65735 | 
: | 352526 | 256062 
 CTdeal-2__|[ 352526 | 256080 | 136129 [61873 | 
PRR-__|[-352508 | 256350 
PRR-2 [352508 | 256350 | 82216 | 39956 | 
P Dire |[-359550_| 920088 | 320679 | B679TT | 

[359550 [ 278965 | 167rit_| 122882 | 

[352508 [255403 [83489 [40247 | 

[352508 [255274 


pire | 


Table C.50: MATMUL(16) (coarse, uncached) — Running Times (cycles). 


| p (number of processors) 
p Manes = ill. i el — 16] 64 — = 25 
[Free-Ideal [| 2719540 | 1854235 | sasa22 | 177033 | 80100 | 
Stat] 2719340 | 1265819 [416058 | 130358 | 01582 | 
[P-Ideal_—_|[ 2719540 | 3518236 | tini024 | s74a00 | 105215 | 
 C-ldeal-t_|[ 2719358 | 1911602 | oai97s | _s0s277 | —124840_| 
[Caldeal-2 2719358 | 1911692 | _oa1ToO | —s188OR | 137920 | 
PRR=1 || 27isaa0_| toriae5_[—sea000_|ToraT0| 119267 | 
[_27i9sa0_[r9rr665 | —sex283 | 191270 | 14610 


PRR2_| 
Di | 2768358_[o9s7270_| 2382532 | Ts30d27_| T5585 | 
PDi-2 || -2768358_[ 207s989 | Tro216 | 396603 | 346164 | 
PTTM | 2719340 rori289 [563301 [190608 | 95920 
PXTM____|[ 2719340[ 1909555 [562160 | 194963 | 103078 | 


Table C.51: MATMUL(32) (coarse, uncached) — Running Times (cycles). 
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| p (number of processors) 
[Mer TT 286 | 1024 | 
PFree-Ideal [| 21501461 | 1ao100s8 | q0sed72 [1959580 [ 405082 | D5I5I0 | 
[Stat] aisoiaed|_ooo107e | sis9039 | aoo109 | a2aa50| 171770_] 
[P-Tdeal__|[ orseiaad_[ora00058 | —asodTa0_| —Bagad6d_|—99RIRD_| ATTIT 
Crldeakt_|[ oiseis02_[ 1aeiziTe | Tradiai_| —20Rne1d | —GoneTI | 29277 
C-Tdeal-2_|[ aiaeis03_[ 1aeizire | Trasoad | —sav071d | —TaT0aT | 2eI211 | 
PRR | orseiaad_[iaeizion | _aaaaeoT | _taaae22_|—s92180_| T0080 
[2iseiasa_[ iisizizs | ao3ieo7 | 1343022 | 514905 | 421980 | 


PRR2 | 
P Diet ___|[ 21745016 [53908520 | Tre2sie1_| Ti00sd95 | B519760 | —— | 
Ppie2 | 21745016 | ter09783 | s2s669 | 2925853 | Ieise27 | _—— | 
PTTM || 21361484 [14842060 | 4234487 | _1342960_| 502854 | 262820 | 
PXTM |] 21361484 [t4ss726 | _apstrox [1343355 | 530603 | 301307 | 


Table C.52: MATMUL(64) (coarse, uncached) — Running Times (cycles). 
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C.7.2 Variable t,, 


| p (number of processors) 

[Mer TE TTT 856 T1024 | 
PFree-Ideal [ 2is0101 | Dasiso7e | orTaio1 | 2116700 | S03571 | 500830 | 
Stat JP 2525444 | 1d4asid9_| sooreai | isri0ad | —sasazi | 304003 | 
[ P-Ideal__[ 2isas444 | aanasaa7_| tanaora7 | aoorsad | 1raoii4 | 881951 | 
C-Tdeakt || 21szsa02_| 2isasoas | Toorars | sa101Te | T1A6G7a_| 52dR08 | 
[C-Tdeal-2_|[ 21szsa02_| discon _| T2aroaes | —sa9n%ed | T2aTs0a | SaTOT | 


PRR-1 || 21525444 [24326392 | Tosorrs [2326224 | 903004 | T6027S | 
PRR-2___|[ 21525444 [24326392 |_TosoTTs | 2341043 | 91RD | 781196 | 
DiI | 21917988 | sss6sr37 | 31213325 | 2019067 | TH96IOIT | —— | 
PDie2 | 21917988 | 28054956 | Ts0533 | 4992768 | DRBSoIS |__| 
PTTM || 2125444 [ 2asisre2 | 7114661 | 2325380 | 901946 | 560805 | 
PXTM___|[ 21525444 [243rs261 [7098634 [2325622 | 902170 | 505961 | 


Table C.53: MATMUL(64) (coarse, uncached) — tn = 2 Cycles / Flit-Hop — Running Times 
(cycles). 


a br of processors) 
[Me tt _4, <6] —SA' — | __—i| 
[Free-Ideal [[ 21853300 | 42734468 | 2080289 | seas56q | [520278 | 9IS149 | 
PStat__—__|f 21853300 [2352545 | aedaa7s [2782885 [988735 [570531 | 
P-Tdeal_—_|[ 21853400 [77056037 _[ 25361019 | 15508979 | 6644712 | 3316380 | 
[ C-ideal-1__|[ 21853418 [—43277393_[ 12426237 [6002250 | 1956558 | 1037355 | 
[ C-Tdeal-2__|[ 21853418 [~43292853_[ T282707s | 6326226 | _Ir14143_| T0OTTOR | 


PRR-T___|[ 21853390 [43277928 | Teeari77_|_4291610 | _1r02392 | 1474251 _| 
PRR-2 || 21853390 | 45277928 | 2858605 | 4291472 | _Tr10295 | 1473729 | 
PDirki___|[ 22205878 | Toaasiso1 | esorsrsr | 44780797 | s2965603 | _— — | 
Dinka || 22245878 | 4as8ds48 | 2oszsei1 | 104e0s6s | 7258197 | __—— | 
PTTM || 21853482 [45273046 | 12826000_[ 4290254 | 101284 | _O5RTSS | 
PXTM___|[- 21853492 [~as202710_ | 12826406 [4290516 [_1ro1680_| 1004857 | 


Table C.54: MATMUL(64) (coarse, uncached) — tn, = 4 Cycles / Flit-Hop — Running Times 
(cycles). 
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a ribo oF processors) SSS 
[Me Tit.) CCC 
[Free-Ideal [| 22509282 | 81192208 | D98Ir227 | 7416160 | 2955165 | I8R4S19 | 
[Stat || 22509282 [_4irosss7 [15919325 | _5205121 | 1874563 | 1102101 | 
[ P-Ideal___|[ 22509276 | a4ra7ii23 [91569701 | 61080065 | 25048328 | 12866223 | 
[ CrIdeakt_|[ 22509204 | sii9erid | 24onzi07 [__S220647 | 4417729 | 2019579 | 
[ CrTdeal-2 || 22509204 [81194822 [21982150 | 8220403 | 3300310 | 1993246 | 
PRR-1 || 22509282_[_srrso73e [21983314 [8222848 [3324033 | 2829060 _| 

[22500282 [siis9238 [24283922 | 8292256 | _a355445 | _S0304T | 


PRR2_| 
PDT | 22014 102_[ 385614536 | Tar2raa11 | Tors50359 | e2e0267a | —— | 
PDir2 | 22014 102[ rese21352 | ris01656 | s6ss457s | Tsez247 | —— | 
PTTM || 22509412 | si187665 | 24281540 | S210678 | 9200472 | TOOBRTE | 
PXTM____|[ 2250942 [_sii95257 [24287956 [8220220 [3300050 | 2055491 | 


Table C.55: MATMUL(64) (coarse, uncached) — tn, = 8 Cycles / Flit-Hop — Running Times 
(cycles). 
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C.8 MATMUL: Fine, Uncached 
C.8.1  t, = 1 Cycle / Flit-Hop 


| p (number of processors) 
a 


Free-Ideal 


[370084 [250140 | 87530 | 41108 | 
[37058 [177405 [65854 [33010 | 
P-Ideal___|[-370684_| 269771 | 122041 | 43855 | 
[ CTdeal-T_|[ 370702 | 270194 
| CoTdeal-2_|[370702 | 270194 [93190 | 46376 | 
PRR | 37084235411 [92211 [47934 | 
PRR-2 | 370684 [235411 [91800 |_5I7a | 
P Diet| s77sis | a420%3 | 194642 | 129980 | 
PDie2 | s7rsts [351660 | 121258 |_ 68191 | 
PTTM | 3706s4_| 307706 |_ 9387 [54221 | 
PXTM [370684 | 256891 


Table C.56: MATMUL(16) (fine, uncached) — Running Times (cycles). 


[Me _T 1] 4, Ss] 
Free-Ideal || 270201 | 1890956 | 579679 | 192056 | 95285 | 
[Stat____|[ 2792044 [1303664 [437048 | 149906 | 78204 | 
P-Ideal__|[ 2792044 [1965900_[_837430 | 211729 | Tor7s2 | 
 Grldealet_|[2792062_[T9s7s41_[ 591216 | 197795 | 102040 | 
[ Ctdeal-2_|[ 2792062 [ T967e41_[ 582517 | 200821 | 99786 | 
PRR-1___|[ 2792041 [1687823 [586390 | 199798 | T16810_| 


| p (number of processors) 


PRR-2 [| 2792014 [1687823 [586169 | 207368 | 129038 | 
Dine 1 || 2842495325583 | Tar5q0r_| 98135 | 322556 | 
[Dire2 || 2842495 [arrest [782205 | 28013ad | 152494 | 
PTT || 2792014189667 [_s8i469 | 206778 | 112407 | 
PxTM || 2792001896108 [575366 | 206040 | 117651 | 


Table C.57: MATMUL(382) (fine, uncached) — Running Times (cycles). 
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a br of proceso) 
[Me i, 4] | 4 [ 6] 0a | 
[Free-Ideal [[ 21652300 | 14761832 | ts1007S | 128209 | a54I22 | 2518 | 
[Stat_—__f[ 21652300 [ 10051914 | a26m1a1 [1012307 | 360160 | 207622 | 
[P-Tdeal___|[ 21652300_[ 15057941 | 6120650 | 1385310 |_S10T91 | 248728 | 
[ Caldeal-T_|[ 21652318 [15056041 | 4305572 [1279048 | —A57I70_| 263063 | 
[ C-Tdeal-2__|[ 21652318 [15066041 | 4275520_[ 1302630 [—a56183 | 255201 | 
PRR-1 || 21652300_[ 1830193 | 4286631 [1265566 | 490230 [389978 | 

[21652300 T2830193_[-az7r495_| 1308846 [478205 [_s51470 | 


PRR-2 | 
PDire1___|[ 22082307 | 2687501 | 9870095 | 4527924 | To25214 | 1490259 _| 
Dpirt2 || 22042307 | aeisissd | soisors | 1rri2s9 | 755602 | 381221 | 
PTTM || 21652300_[14rrisrs | 4205281 | 1325285 | _A7e531 | 295840 | 
PXTM___|[-21652300[ etait _|-4rs3621 | 1386063 [509640 [293876 | 


Table C.58: MATMUL(64) (fine, uncached) — Running Times (cycles). 
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C.8.2 Variable t,, 


| p (number of processors) 

Pp Mer TE ET] 1024 | 
FFree-Ideal [[ 21980100 | Das00s30 | —oss0050 | DisvaIs | T7212 | 128300 | 
Stat [| 21980100 | idssidis | 5134803 | 1045509 | 03019 | _300033_| 
P-Ideal— |] 21980100] 2ico2R09 | Todoosii_| aioisa1_| —a7s255_| —aadotT | 
[C-ldeakt_|[ aiosoris | 2i70%s19_| _soondT6_| 2a0zaad | _—TraDR | —AHOIRT 
[Crldeal-2_|[ aioeoris | 2170004 | —Tao0d0T | 214967) | —TaRSIT | —AATTSG_ 


PRR-1 || 21980100_[ 24709900 [7033305 | 2179B51 | 830849 [744295 | 
PRR-2 || 21980100_[ 19942596 [706609 | 242732 [816729 [735381 _| 
DiI ___|[ 22375593 [30955831 | Tr4760s | 7744693 | 3928718 | 2728089 _| 
PDie2 || 22875593 | s20sers3 | 10217s94 | siea637 | 1291874 | 665254 | 
PTTM || 21980100_[ 20236650 | 7030368 | 2216803 | S2728 | —AR5835 | 
-XTM |] 21980100_[ 20236722 [7113994 | 2927559 | _B56148 | _ARDOT5 | 


Table C.59: MATMUL(64) (fine, uncached) —t,, = 2 Cycles / Flit-Hop — Running Times (cycles). 


a rib oF processors) 
[Me Tit 4#t SCC 
[Free-Ideal [| 22635726 | sinase7s | 12e26214 | s907s02 | 1450590 | 776006 | 
[Stat || 22635726 | 28040587 |_ 8866307 | 2911965 | 1088950 | 663265 | 
[ P-Ideal || 22635736 | 43962080 | 18526642 | _5067859_| 1691956 | 850316 | 
[ CrIdear-T_|[ 22635764 | 4964084 | 12670352 | _s913197 | 1434321 | 820946 | 
 CrIdeal-2_ || 22635764 | 4996199 | 13600879 | 4023870 | 1453164 [824704 | 
PRR-T || 22635726_[ aa967713 | 12670904 | 3980386 | 1543615 | 1420381 | 
PRR-2 || 22635726 | qa96r713 | 12738699 | q0s7z18 | 1656624 | I606TTS | 
Di |[ 23043277 | roait543 | 35091067 | 12392624 | 6762955 | 1680617 | 
PDir2 | 23043277 | orors1n1 | 2aez57s | 6749695 | 27s1156 | T4721 | 
PTTM || 22635818 [35050760 | 12480330 [3983821 | 1595955 [855220 
PXTM___ | 22635828 [34460805 | 12803042 | aaoR527 | I58RISd | BO2DI5 | 


Table C.60: MATMUL(64) (fine, uncached) —t,, = 4 Cycles / Flit-Hop — Running Times (cycles). 


riba of proceso) 
[Me COCO 
[Free-Ideal [[ 23916078 | S3584606 | 28243558 | Tal28a4 | 2729209 [| 1490544 | 
[Stat____|[ 23046078 [42458765 [10320195 |_54aa7a7 [2060857 | _12T06IT | 
P-Tdeal || 23946972 | Tisoe2rat [49400278 | Srs55r1 | _3687700 | 1955760 
| C-Tdeal-1__|[ 23916900_[_ 82568456_[ 25588616 | 7530083 | 2688134 | 1510855 | 
| C-Ideal-2 || 23946900_[_63d98041 [24806013 | 7494771 [2728884 [_I546171 | 
PRR-1 || 23946978 [62516554 [23005915 | Tanai [2969406 | 2801732 | 


PRR-2 || 23946978 [__s2569647 [24360516 | 7609020 | _s107I73 | _2968417 | 
PDife1 || 24377507 _[Tos325301 | T1o036655 | 32052696 | 14765763 | 1118528 
Dina || 27507 [iaase0ra9 | a5305085 | 16573886 | r476504 | 3802137 | 
PTT M || 23047108 | 64678465_[ 28483204 [8317440 [2800915 | T5R285 | 
PxXTM___|[2304ri1s | eaerrris_[_2a7rese7_[ 8562822 [3086301 [1655296 | 


Table C.61: MATMUL(64) (fine, uncached) —t,, = 8 Cycles / Flit-Hop — Running Times (cycles). 
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